I'm experiencing an odd resolution issue on my DNS servers. I have a couple decades experience with administering Windows DNS servers, but have less than a year's experience administering Ubuntu/BIND 9 servers. A little background on my environment:
I work for a small service provider and administer three Ubuntu/Bind 9 servers. They are configured as a Master and two Slaves. All three servers are configured with Private IP addresses on a VLAN reserved for servers, with Static NATs on our firewall to the slaves. The Slaves are accessible from both our internal network and the Public internet, but Public recursion is limited to the IP subnets we host. The Master only allows access from the two Slaves and from our internal Managment VLAN. The Master and Slave2 are Ubuntu 12.04 running BIND 9.8.1-P1. Slave1 is an older system, scheduled for replacement, running Ubuntu 9.04 and BIND 9.8.1-P1. I am seeing the same problem behavior on both Slaves. I built the Master and Slave2, and inherited Slave1 from a previous admin.
Here's the problem: If I do a NSLOOKUP from a system on one of our hosted IP subnets for office365.com., I get a successful resolution. If I try to resolve outlook.office365.com., I get the following error:
***UnKnown can't find outlook.office365.com.: Unspecified error
I can successfully resolve, through NSLOOKUP, both of those URLs from a system on the server VLAN and from the console of both slave servers. This problem was reported to me by a client who stated that he's seen this issue on a handful of URLs, but outlook.office365.com is the only one he could specifically remember. I've tried a number of other URLs and they all resolve successfully. I can only replicate the issue with that one URL. (Hopefully the client will remember some more.)
I setup a query.log, based on an article I found on this site, and see the request come in regardless of where it originates.
Example:
client MYIPADDRESS#1067: query: outlook.office365.com IN A + (BINDSERVERIPADDRESS)
If I change my DNS server to 8.8.8.8 or 4.2.2.2, it resolves correctly; adding both of those as forwarders on my Bind servers doesn't fix the problem. I've checked my syslog, but see no entries regarding that query that could offer clues. I also tried allowing recursion from "any", but same issue. I've also reviewed our firewall rule set and don't see anything that could account for this. It seems that if an access-list was the problem, no DNS queries would work. Anyone have any ideas? Is there a way to log a reason for a query failure?