Zerigo Blog

Our Services
June 9, 2010 at 15:03 by thomas

This morning's DNS issues

As many of you already know, we had a partial DNS outage this morning.

Two of our nameservers were affected: a.ns.zerigo.net and b.ns.zerigo.net. The others continued to function fully.

Initially the affected nameservers failed to respond to only some DNS queries. Due to secondary effects, this cascaded into complete DNS resolution failure for the affected nameservers. The total time of less than 100% DNS query resolution was just under two hours.

Each affected nameserver handles two separate DNS functions for our network: standard DNS query resolution and slave zone transfers from external masters.

This morning’s problems surround the latter operation. Malformed data from a zone transfer caused the nameserver processes to die repeatedly. The malformed data is a narrow, specific edge case and has been identified. Obviously the nameserver processes did not properly handle this particular data. We will be working to get this fixed.

Our DNS servers also each run an additional guardian process designed to restart the nameserver process in the event that it dies. While the guardians did restart the nameserver processes for a while, in the face of nameservers repeatedly dying, the guardians eventually failed. We will also be working to get this resolved.

We already have a project underway that includes separating all zone transfer functions from all DNS resolution functions. Had this project been finished and deployed, today’s issue would have been much more contained: only zone transfers would have been affected and DNS query resolution would have remained at 100%.

The good news is this project (which is broader in scope than just relocating zone transfer functions) is nearly complete. We are going to accelerate the final deployment of the entire project. More news about that transition will be published this week.

We sincerely apologize for today’s DNS outage. We know we have let you down. As outlined above, we’ll be making specific changes going forward to not only address the particular causes of today’s issues, but also to improve our network’s resiliency to the unexpected.

Please do not hesitate to contact us if you have any remaining questions or concerns.

Comments

On June 10, 2010 at 21:28, Travis Warlick said:

It is refreshing to see a company take responsibility. I will definitely be moving my DNS & monitoring to you in the very near future.