Background
You can start by reading this. This is a tested path forward if you find yourself in the CMS split-brain scenario as described in that article. After noodling through that process yesterday, and knowing that I have customers who need this to work so as to ETHICALLY meet their SLA/RTO/RTP type stuff, I got to thinking. And then Josh Walters, a co-worker of mine, made the fateful comment “the server that gets failed over to is happy and functioning, why can’t we just leave it alone?”
In relative age terms, from the mouth of babes…I got to thinking – can I create a process that is repeatable, that comes before the Vale 19-step method, and allow me to confidently tell customers that “this works.”
Scenario
We are ignoring the RGS and the Edge changes necessary for the full site failovoer in this article. We are totally focused on just the CMS, why it happens (theory on my part), and what to do to recover gracefully in a predictable manner (empirical on my part).
There is actually the option to not perform a CMS failover…I have had environments where the CMS was offline for extended periods of time with no ill effects. Just don’t change anything.
Our environment is two SfB SE servers, pool paired. Sfbse.tsoorad.net is the “old” master, sfbse2.tsoorad.net is the “new” master.
After making up the pool pair, we have simulated the datacenter outage by turning off sfbse.tsoorad.net, thereby making the surviving system components think that the CMS master is gone. Power off is a state that pretty much assures that no-one is talking to that server anytime soon.
The initial CMS server failover goes just fine. The problem comes up when the “old” master comes back online and thinks that IT is the master. But the sfbse2 server, the “new” master is in charge, and suddenly, you cannot make changes. Classic split-brain. Replication is borked. Attribute pointers don’t point. See the blank in this example where the ActiveMasterFQDN might just be something we need to know about.
What is causing this
If my surmise/theory is correct, the split-brain starts when the second node assumes control of the CMS. No problem. As a domain member running with the proper authority/credentials, the AD gets changed, the topology gets changed, and the surviving servers in the environment start replicating from what they are told is the CMS. At this point everything is fine; the split-brain has started, just not affecting us quite yet.
The split-brain posture really gets wound up when the failed server comes back online and it thinks that it is the CMS master. Understandable. Before whatever happened happened, that server was indeed the CMS master. But another server is now designated, and the newly revived server never got re-written, and things are now just a tad stuck. Again,
see this article here, as well as the
Mark Vale article here.
What to do about it
The obvious answer, of course, is the easiest. We will wait right here as you locate your copy of last nights backup script and the ensuing copy of the export-csconfiguration and export-cslisconfiguration and carefully resolve NOT to use them. (they point to the OLD master, and the NEW master is up and running – and in the immortal words of Josh Walters, “can’t we just leave it alone?)”. Keep in mind that you don’t HAVE to move the CMS back. To dovetail with the Savant Walters, we can further notice that the CMS has a failover cmdlet, but no failback cmdlet.
You will make new ones here in the next section and they will be better as they will not reference the original CMS master (pre failure) as the master or being “active”.
Fix me!
From the new master run:
- export-csconfiguration (we are just being thorough, you should not need this file for this exercise)
- export-csLISconfiguration (ditto)
- Place your new exports where you can use them in case you don’t already have them, and then throw them away after the next time your backup captures that data. If you get the end of all this, and invoke a failback to the “old” master, you can throw the exports away in that case also. You do have a plan, right?
- stop services on “new” master: FTA, LyncBackup, Master, Replica
Bring the “old” master back online.
- From the “old” master, stop services: FTA, LyncBackup, Master, Replica
From the “new” master:
- install-csdatabase –centralmanagementdatabase –sqlserverfqdn sfbse.tsoorad.net –sqlinstancename RTC –clean –verbose
From the “old” master,
- start the SfB deployment wizard
- Run Step 1 (install Local Configuration Store)
- Run Step 2 (Setup or Remove
- Start the services stopped earlier BUT DO NOT START MASTER
From the “new” master:
- invoke-csbackupservicesync –poolfqdn sfbse2.tsoorad.net
Wait a bit, then run through:
Get-CsManagementConnection (should show “new” master)
Get-CsService –CentralManagement
Get-CsManagementStoreReplicationStatus –CentralManagementStoreStatus
Here we are, fixed. Note that the “new” master is still the master, but now the “old” master thinks it is no longer the master, but subordinate to the “new” master. All this is just fine. We don’t care WHERE or WHO holds CMS master, as long as we have a posture where we can read/write topology.
At this point, you could do another Invoke-CsmMnagementServerFailover and get the CMS back over to the “old” master…if you are into consistency like me, then that is what you will do. If you are like others, you can leave the CMS on the “new” master, and everything will be fine.
Summary
Seeing as how there is no failback cmdlet, could it possibly be that this is all by design, and was never properly documented on the way out of Microsoft-land?
Empirically, as long as both SE pool pair members are up, the CMS failover process is just fine. If the “old” master is down, things go bad quick and the prudent admin will be prepared to handle that scenario – however remote the possibility may be.
If your CMS fails, then you could be failing also. Invoke-CsManagementServerFailover is wonderful, provided all the players are still running. Not so hot when the the existing master is no longer available. This process will get you in a posture of success; is repeatable, and is not too onerous. Ergo, we have something i can feel somewhat good about taking to the customer.
YMMV