TsooRaD: SE Pool Pair w/CMS still borked

2018/01/19

SE Pool Pair w/CMS still borked

Background

SfB pool pair is a method by which SfB attains either local (inside the DC), or potentially globally, but at least to a second DC, disaster recovery. I say a “measure” because the end result is not totally automatic, there is an established procedure to complete the failover process. Overall, considering the vagaries of networks and host environments, pool pair is a very viable option.

The problem that surfaced for me is that I have been heads down on two projects since 4Q 2016. Outside of some side trips for pre-sales, and minor involvement with long-term customers, two Enterprise Pool projects have held my attention for the last 12 months or so. And during that time frame, I swear I saw traffic that indicated the following issue/bug/problem/glaring oversight was fixed in a CU. Apparently, my myopic world of EE servers and pool pair worked just fine while the SE world floundered on.

Scenario

Brand new set of SE servers, one per DC. 2012R2 patched up to current, SfB from RTM followed by May 2017 CU (on a tangential note – has anyone else noticed that there have been no on-premises updates for 9 months?). Change topology to create an SE-SE pool pair. CMS was on the DC1 SE. This of course, is now present on both SE’s. All of the pool pair setup seemed to go normally.

At this point, we need to test things for failover/failback. Document that procedure for our specific environment and we have accomplished a set of deliverables. Except that it don’t work. At this point we will wait while everyone brushes up on Invoke-CsManagementServerFailover and Invoke-CsPoolFailOver.

What we did and observed

First, we failed the sfbse.tsoorad.net server so as to simulate the DC failure. No datacenter, no CMS host. We need to come live in the second data center on the pool pair partner (sfbse2.tsoorad.net!. Note that the invoke-csmanagementserverfailover command line has gotten a bit longer than just the cmdlet. You can use just the cmdlet from the new target, and it will work… but with the existing CMS master down, the cmdlet needs to have some other parameters.

We start the entire thing by doing a CMS failover. Everything goes as advertised. Replication occurs after the move as expected. “Get-CsManagementStoreReplicationStatus –CentralManagementStoreStatus” contains the anticipated values in the anticipated attributes. So far, so good.

Screeching halt on trying to move the CMS back. Uh oh. Not having CMS is a bad thing. Lucky me, I ALWAYS have a copy of the CMS handy. You should have some knowledge about export-csonfiguration and export-cslisconfiguration and what those files are good for. Here is an example of what they are good for! (while you are at it, you may want to refresh yourself on export-csrgsconfiguration and import-csrgsconfiguration (seeing as how we are talking about doing pool failover)).

Having resolved the outage that caused us to failover, we need/want to failback to the original CMS holder…

While we have a good starting point, that is where the goodness stops.

This is scenario #3 as shown here. Not good. Especially as how the stated cause for scenario 3 is #1 or #2, and we match neither. Oh joy.

Fast forward through a lot of verifying and google-fu. Rather than write up what fellow MVP Mark Vale has already done so well… Let’s just wait while you read all of his write up. You can access that right here.

Recovery

It is entirely correct to observe that I did not have the exact SAME symptoms, but enough matched up that I tried his 19 (!) step fix. While my errors did not match the KB article, my symptoms and outcomes come close to matching the Mark Vale article. And his fix worked for me.

As a follow-up test, once I recovered, I stupidly was Mr. Clever and redid the entire test. Guess what? Different results. Nice and consistent inconsistency. And then, in a fit of sheer madness, I did it again. And got yet a different set of outcomes. However, I will note that in each case, Mr. Vale’s 19 step program worked. Sometimes I had to work through the steps more than once to resolve things.

Conclusion

As Mr. Vale opines, make sure you have a good copy of the CMS via the export-csconfiguration method. Saves your bacon, less filling, and works. If all else fails, you can go full DR mode and just restore the CMS from scratch (albeit with a CMS that has details :) ). If you read the Vale process, what is being done there is a manual recovery of the CMS.

Also, note that doing the CMS failover with the existing master still available is no issue. Things work as expected. The problem starts when you try an actual failure scenario and the original master never gets updated and the environment ends up in a quasi-split-brain posture. You may get lucky and have the new master be intact. At least one of my tests ended up with the new master being orphaned also. At which point you need to be ready for serious surgery.

I think the next test will be inserting portions of the 19 steps into the failover scenario BEFORE trying to get back to the original master. And remember, your backup/recovery process is only as good as the last time you tested successfully.

I think this whole thing falls squarely into the “you must be kidding me” category; Alas! I am not.

YMMV

2 comments:

Unknown said...: All I can say is, "been there done that!"I could not have written this better myself. "you must be kidding me” category is very true. Having got ourselves (or Skype got itself) into orphaned CMS situation through DR scenario testing following a CU update, Microsoft spent hours working with us in trying to fix it, in the end I had to call it a day and implement full DR procedures to get everything back to normal. There is very little acknowledgment from Microsoft on this issue and certainly seemingly no attempt to fix it. I now have no trust in Skype failover at all.; Mon Jan 22, 06:27:00 AM PST
Unknown said...: All I can say is "been there, done that!" I could not have written it better myself and completely agree. Following CU update testing DR scenarios we ended up with orphaned CMS and despite several hours of working with Microsoft the problem could not be resolved and full DR procedures were performed. Very hard to get Microsoft to admit to the problem and certainly seems no effort to fix. We have a very unreliable DR process with Skype currently, which is a very uncomfortable position to be in.; Mon Jan 22, 07:25:00 AM PST

TsooRaD

About Me