Zerigo Blog

Our Services
July 27, 2012 at 23:11

Life with Zerigo DNS after July 22nd

What happened?

On July 22nd we found ourselves at the receiving end of a massive stream of DNS replies generated through a DNS amplification attack significantly larger than those which we had been subject to in the past. The magnitude of the flow of packets humbled both our DNS server infrastructure and our ISP transit networks at SoftLayer and Linode. We had successfully blocked smaller attacks in the past but this attack was at such a large scale that alarms at our ISPs went off in concert: null routes were placed, the attack was blocked and most of our nameservers on different continents, in different countries, in different data centers, at different ISPs were down – at once!


We spent a few hours characterizing the attack. We let people know what was happening and set out to block the influx in a way that would allow us to continue to serve DNS requests. Eventually we found a way to filter the attack very specifically and resumed service.

What are you doing to make sure this won’t affect service in the future?

  1. Increase capacity


    Our immediate plans are to increase capacity to the point where we could ride out a similar attack with no impact on our customer DNS services. The big problem during the characterization stage was that our transit links were saturated, making direct sampling of the traffic very difficult. We are currently in discussions with providers in Asia, Europe and North America to substantially increase our IP transit capacity in each location.

  2. Negotiate DDoS mitigation policies


    We will negotiate new favorable policies with all of our existing carriers related to dynamic DDoS mitigation system events. Our preference is that these systems are to be used to block the source of the DDoS stream and not the destination! Obviously this is easier said than done considering this attack appeared to come from over 38,000 unique source IP addresses (mostly in Europe), but we feel basic policies could have prevented these systems from disabling multiple nameservers around the world at exactly the same time. We will absolutely insist that any null route placed by a carrier in relation to our service be removed at our request, provided we have established filters to reasonably block the attack which caused the null route to be entered. We cannot be put in a position where we must wait hours for a null route to be lifted while our infrastructure remains inaccessible.


    The fact that the DDoS policies with the carriers had not been specifically negotiated, opened the door for nameservers at multiple sites in the United States and the UK to become simultaneously disabled. As an example, our nameservers in Asia & Europe remained viable for most of the outage specifically because our carrier in those locations (Voxel) did not null route our traffic.

  3. What about AnyCast?


    AnyCast would have allowed us to pin the influx of traffic on a single site, and perhaps bring up a new nameserver IP address elsewhere and swing our NS records around to recover. Unfortunately the propagation delay involved in the NS record swap would have likely lasted longer than the outage itself. Regardless of how useful AnyCast actually is, we acknowledge that the market perceives AnyCast as a benefit in such conditions.


    AnyCast also has performance benefits. With AnyCast, the selection of nameservers used to resolve queries can be limited to a set which are topologically nearest to the client. By virtue of the BGP route selection algorithm, a client in Europe would be directed to nameservers on 4 unique AnyCast strings at different ISPs in the EU while a client in New Zealand would be directed to nameservers on the same AnyCast strings in Asia.


    In conclusion, we are considering plans to launch new nameservers on at least two AnyCast strings by the end of the year.

  4. What can I do to improve by DNS reliability now?

    Many of our customers are astutely using multiple DNS providers to cover their DNS infrastructure. We use multiple ISPs to provide redundant connectivity at all of our data centers, why not treat your DNS infrastructure the same way? See our new FAQ article which covers how to setup Zerigo as a master DNS service to be used in concert with slave services from DME or other providers.

Comments

On July 31, 2012 at 03:30, Zane Lucas said:

No, you did not do half of what you are supposed to do. As a perceived global failsafe DNS provider you were sloppy at your work and left your customers offline for A WHOLE DAY. This incident was not resolved in a timely manner and there was not adequate notification to your customers. Additionally, Trustico received a $9 credit from Zerigo for the downtime incurred – are you joking? Zerigo have some serious learning to do.

On July 31, 2012 at 19:31, Naeem Taj said:

yes I agree with you Zane, poor job on Zerigo side. We got $5.00, what a joke. We will be moving away from them…lets take them out of business, maybe then they will value their customers!

On August 2, 2012 at 17:04, Brad Folkens said:

I agree too. Communication was horrible, the refund was a joke, and on top of all of that, I was just charged for another year worth of service even though I cancelled the day it automatically renewed.

Horrible customer service and communication, terrible experience overall.

On August 3, 2012 at 07:09, Bob said:

Zerigo have failed pretty embarassigly, but let’s hope that this teaches them a lesson and they deliver in the future.

On August 31, 2012 at 03:20, Jason said:

I think the biggest thing that made me leave Zerigo is that there was never any kind of apology or mea culpa. Many large companies depend on Zerigo, and when people rely on you, you are responsible regardless of what happens. I have yet to see anything that resembles an apology. Most of it sounds like excuses about how they were “attacked” and somehow that makes them not responsible to the people that depend on them. Before this incident I had a great opinion of Zerigo, but now I have lost the respect I had for their service.

And yeah, The refund for downtime was a joke as well. All Zerigo’s reaction to the incident have been infuriating.

On September 1, 2012 at 06:02, Business Phone Service said:

Now days Business Phone Service come packed with a lot of features. You can basically customize your business phone service to fit your business needs. When you see the features, you will know how your current phone system is not good enough. I have been using hosted pbx from Telcan. With this business phone service, I can customize the welcome greeting, route calls to multiple numbers, even program what phone number to call at what time, have a professional voicemail, get my voicemail emailed and the list goes on. It is very easy to set up using wizard. Check out Business Phone Service

On September 10, 2012 at 20:59, Colorado Colocation said:

DDoS attacks have been around for a while, there isn’t really anything one can do to prevent them. Suggesting a backup DNS provider isn’t going to appeal to very many people however, so let’s just hope policies improve for Zerigo.

On September 23, 2012 at 07:31, ducsu said:

DDOS attacks can happen to any businesses, big or small. That is just how it is. I am sure we all know a similar attack happened a few months ago with a well-known dns provider. Their attack brought down hundreds of sites for almost a week if not more. Even they did not formally gave an apology. I am not saying it is ok. It is just bad business practice. As a customer, we have the option to go with whoever we choose. Who is to say that that company is better than this one. It is a risk we take. Things happen.

On November 19, 2012 at 10:53, Cheap SSL Certificates said:

In my point of view that if you want to make sure DNS reliable then move with dedicate server which will make your website 100% uptime.

On August 2, 2013 at 10:19, SEO Analytics said:

We’ve managed our own geoDNS and traffic shaping for about 7 years now, and we handle over 15mil queries a day. It doesn’t make us experts, but in all that time we’ve never had a problem like this. We’ve implemented DDOS measures, which isn’t too hard to do with BIND and load balancing.