Wednesday 6 July 2011

WARNING the Domino 852* ID Vault does no play well with 850* client code stream and there is no fix!

This is by way of a warning to anyone who uses ID Vault and has clients on the 850* code stream and is considering moving their domino servers up to the 852 code stream - DON'T DO IT!

Last week we moved one of our production servers up to Domino 852FP2 (300+ mail clients) and the Admin server for the domain started to get very very very slow, bandwidth vanished and everything all over the 20+ servers suddenly got very groggy.

We quickly tracked the problem to ADMIN4.NSF it was 2.4Gb in size and had over 6 million documents in it. 5,996,891 of them where HTTP Password Change requests originating on the server we just upgraded to 8.5.2FP2. A morning of testing showed that any Notes Client attaching to the affected server that was on the 850* code stream was generating an HTTP Password request everytime it started an NRPC session with the server. In some connection senarios 1000's of these requests were appearing a second. These then replicated from the affected server over to the ADMIN server in Ireland where is was actioned, updated the NAB and then replicated the NAB and ADMIN back to the affected server in the Czech Repulic and then out to all the other servers in the domain ... what fun!!!
A road warrior user reported a client directory replications of 26,000+ updates over the space of 12 hours.

We got around this initially by changing the ACL on ADMIN4.NSF and Denying Access to the affected server. This stopped the HTTP Requests appearing in ADMIN4 and allowed the network to return to normal. In the meantime I turned off the SYNC Internet password in the sec policy for that server pushed out the policy and then relaxed the ACL and the HTTP requests disappeared a quick agent run later ADMIN4,nsf had all the rogue records deleted and it had been compacted.
Now the change to "non-syncing the HTTP password", while being outside the strictest interpretation of our security policy was no "real" problem short term as the password change intervals gave us a couple of weeks of grace before this becomes a real problem.

I opened a PMR with IBM and informed them of the situation and the facts of the case. Their response was quick and unexpected- Ok upgrade all your clients to the 851 or 852 code stream ... that is the fix ... can we close the PMR?
WFT? 
Sorry what?
Could you repeat that?
A potential server and domain threatening bug which IBM acknowledge is a problem in code that at both 850* and upwards are within support and there is and will be no fix, just the advice to upgrade all your 850* clients to fix the problem ???

I rechecked the Upgrade Instructions and not a hint, link or suggestion that this could or would be a problem. Needless to say had there been a "If you are running... etc" warning I would NOT have upgraded the fecking server until all the clients were on the 851 code stream or better. There are Tech Notes that mention it at other releases but not the combination we had. Since there were no actual errors as such and the problem was silent when tested on a standalone server in a standalone domain where the rogue HTTP server requests were all created and actioned and resolved quickly and with no replication or heavy user load, this problem went un-noticed.



Now a quick check on the interweb showed that this has been a problem on an off for several 8* releases and it has been addressed previously ... I am asking myself how does the same bug get back into the code stream with such seeming regularity ? The answer to that I will leave you to make your own mind up about .. but I am thinking change control.

So "Upgrade all your clients or roll back the server" were the only options we were given to answer our PMR followed by a request for it to be closed. Well in the situation why the hell ask me if the PMR can be closed?  For me there is still a problem, it is a bug and it has NOT been addressed other than to provide something unexpected, unplanned and unbudgeted upgrades to 200 odd clients before the end of the month. Even so I am assured that the PMR will not result in a fix and that old chum is that!

I made sure that my reluctance to close the PMR was noted and that I thought my expectation of "support" for an active code stream was very markedly different from IBM's I also stressed that it would have been helpful if this problem had been written in a nice large font in the upgrade notes.

IF YOU ARE USING ID-VAULT AND HAVE 850 CLIENTS DO NOT UPGRADE.

would have helped more than the suggested fix.

A colleague had encouraged me to open this PMR and even though my already somewhat jaundiced view of support being offered not only by IBM but other big companies made me reluctant to do so, however the idea that "Well things will never get fixed if you dont report them" convinced me that I should be a good upstanding net-citizen and report the problem.

I did .. and it gives me no satisfaction to report my experience was a factor of annoyance many times worse that even this grumpy 30 year IT industry cynic could have imagined. Will I waste my time reporting an issue the next time we have one, or will I just not bother and find my own way to work around the problem?


Bah Humbug!

Anyway rant aside.. please do be careful if you are planning an ugrade, while this will not crash your servers it does have the effect of ADMINP jobs taking 70+% of the CPUage and your bandwidth will probably drop to 1989 speeds and remember IBM don't mention this in the upgrade notes so it will take you by surprise.

** UPDATE ** I forgot to mention the Replication Conflicts in the NAB if you have HUB and SPOKE servers.. since the NAB is being updated several times a second you get lots and lots of replication conflicts when the remote servers can't keep up


Disqus for Domi-No-Yes-Maybe