ms-diagnostics: 1008;reason="Unable to resolve DNS SRV record";domain="domain.com";dns-srv-result="NegativeResult";dns-source="InternalCache";source="SfBSIP.domain.com"
SFB on-premises patched to November 2015. Split-DNS. Firewalls, networks, and even VLANs are all highly segregated. Classic DMZ in operation with outside firewall, inside firewall, no internet browser access from DMZ servers. Port 53 outbound from DMZ servers is not allowed.
The edge servers are using internal DNS resolution (hello InfoSec!). Everything is testing perfectly. IM/P, WebCon, media flow; the mobile clients are working, and PPT publishing internally and externally is perfect. After working through the expected HLB and firewall issues, we are looking right successful. First time through. Nailed it. But wait!
Organization moves from closed federation to open federation. About a week later we notice that federation is suddenly borked – and one-way presence rears it’s ugly head – it would appear that federated partner –> internal org can start things, but the opposite does not work so well. However, everything except presence works AFTER the inside person responds to the outside –> inside toast. Screen sharing fails also – unless the outside person starts the screen share, then the inside person can share. This is a hint for you troubleshooting mavens – we’ll wait while you digest all this information.
We traced the above client side errors and see the following:
…and the resulting 504.
We traced the same errors from the server-side (thank you centralzed logging) and see the same set of outcomes. Here is a simple subscribe request from the inside to a federated partner…
…and you can see the 504 – I cannot find out who I am because I cannot resolve my federation SRV record. This is not good.
A side symptom was that we were seeing similar 504 errors on test-csfederatedpartner and test-csmcxpushnotification.
Hmmm. Does this look like the Edge server cannot find itself? Like there is no _sipfederationtls._tcp.domain.com record? Consider the lock-down environment, and the requirement that all DNS come from the inside…and the inside is going to be authoritative for the zone. Hmm. Lync 2013 documentation (essentially the same for SfB) indicates the SRV record for _sipfederationtls._tcp.domain.com needs to be on the external DNS server. So, go double check that. Yes. We got that part right.
Simple. We put the _sipfederationtls._tcp.domain.com SRV record into place on the internal DNS, with the proper target. And then modified the host file on each Edge server to have the public IP for themselves. We did a TTL of 5 minutes on the SRV record. Almost immediate relief. It was like watching Bones cure the planetwide plague with a simple shot of his hyper-injector and you get watch the horrible disease be cured before the next commercial break.
Why did the transition from closed federation to open federation cause this? And why did “this” take 7 days to manifest itself in failures? Why didn’t the issue show up immediately?
I can guess at the first, as to the second and third, I am clueless. I am not willing to guess in a public forum, so you will have to draw your own conclusions. But I do know what fixed this issue – the federation SRV record being added to the internal DNS zone and modification of the Edge Server host files so that they can find the SRV target by IP.