About Me

My photo
TsooRad is a blog for John Weber. John is a Skype for Business MVP (2015-2016) - before that, a Lync Server MVP (2010-2014). My day job is titled "Technical Lead, MS UC" - I work with an awesome group of people at CDW, LLC. I’ve been at this gig in one fashion or another since 1988 - starting with desktops (remember Z-248’s?) and now I am in Portland, Oregon. I focus on collaboration and infrastructure. This means Exchange of all flavors, Skype, LCS/OCS/Lync, Windows, business process, and learning new stuff. I have a variety of interests - some of which may rear their ugly head in this forum. I have a variety of certifications dating back to Novell CNE and working up through the Microsoft MCP stack to MCITP multiple times. FWIW, I am on my third career - ex-USMC, retired US Army. I have a fancy MBA. One of these days, I intend to start teaching. The opinions expressed on this blog are mine and mine alone.

2016/02/17

1008;reason=Unable to resolve DNS SRV record

ms-diagnostics: 1008;reason="Unable to resolve DNS SRV record";domain="domain.com";dns-srv-result="NegativeResult";dns-source="InternalCache";source="SfBSIP.domain.com"

Scenario Outline

SFB on-premises patched to November 2015. Split-DNS. Firewalls, networks, and even VLANs are all highly segregated.  Classic DMZ in operation with outside firewall, inside firewall, no internet browser access from DMZ servers.  Port 53 outbound from DMZ servers is not allowed.

The edge servers are using internal DNS resolution (hello InfoSec!).  Everything is testing perfectly.  IM/P, WebCon, media flow; the mobile clients are working, and PPT publishing internally and externally is perfect.  After working through the expected HLB and firewall issues, we are looking right successful.  First time through.  Nailed it.  But wait!

Organization moves from closed federation to open federation.  About a week later we notice that federation is suddenly borked – and one-way presence rears it’s ugly head – it would appear that federated partner –> internal org can start things, but the opposite does not work so well.  However, everything except presence works AFTER the inside person responds to the outside –> inside toast.  Screen sharing fails also – unless the outside person starts the screen share, then the inside person can share.  This is a hint for you troubleshooting mavens – we’ll wait while you digest all this information.

TShoot

We traced the above client side errors and see the following:

Subscribe attempt…

image

…and the resulting 504.

image

We traced the same errors from the server-side (thank you centralzed logging) and see the same set of outcomes.  Here is a simple subscribe request from the inside to a federated partner…

image

…and you can see the 504 – I cannot find out who I am because I cannot resolve my federation SRV record.  This is not good.

image

A side symptom was that we were seeing similar 504 errors on test-csfederatedpartner and test-csmcxpushnotification.

Hmmm.  Does this look like the Edge server cannot find itself?  Like there is no _sipfederationtls._tcp.domain.com record?  Consider the lock-down environment, and the requirement that all DNS come from the inside…and the inside is going to be authoritative for the zone.  Hmm.  Lync 2013 documentation (essentially the same for SfB) indicates the SRV record for _sipfederationtls._tcp.domain.com needs to be on the external DNS server. So, go double check that.  Yes.  We got that part right. 

The Fix

Simple.  We put the _sipfederationtls._tcp.domain.com SRV record into place on the internal DNS, with the proper target.  And then modified the host file on each Edge server to have the public IP for themselves.  We did a TTL of 5 minutes on the SRV record.  Almost immediate relief.  It was like watching Bones cure the planetwide plague with a simple shot of his hyper-injector and you get watch the horrible disease be cured before the next commercial break.

But WHY?

Why did the transition from closed federation to open federation cause this?  And why did “this” take 7 days to manifest itself in failures?  Why didn’t the issue show up immediately?

Summary

I can guess at the first, as to the second and third, I am clueless. I am not willing to guess in a public forum, so you will have to draw your own conclusions. But I do know what fixed this issue – the federation SRV record being added to the internal DNS zone and modification of the Edge Server host files so that they can find the SRV target by IP.

YMMV

2 comments:

Zyphen said...

Hey what do you mean by "And then modified the host file on each Edge server to have the public IP for themselves"

we are having the exact same issue

tsoorad said...

@zyphen
"But I do know what fixed this issue – the federation SRV record being added to the internal DNS zone and modification of the Edge Server host files so that they can find the SRV target by IP."

So, create a federation srv record internally for the edge server to find. And if that SRV has a target of xyz@domain.com, and you need that for something else internally, you can go to the edge server, and put an entry into the host file for the target of the federation SRV record. When the Edge server looks up the federation SRV record, it will also try to resolve the target - and when it finds that target FQDN in its host file, it will use it.
So you give the edge server access edge out as your FQDN for target of the federation SRV, and the host file gets the PUBLIC access edge IP = FQDFN and problem solved.

Clear as mud?


Technical Consulting

Something went through both of my brain cells today. And to keep a long story short, it centers on your approach to the question – whatever ...