Life with Zerigo DNS after July 22nd
On July 22nd we found ourselves at the receiving end of a massive stream of DNS replies generated through a DNS amplification attack significantly larger than those which we had been subject to in the past. The magnitude of the flow of packets humbled both our DNS server infrastructure and our ISP transit networks at SoftLayer and Linode. We had successfully blocked smaller attacks in the past but this attack was at such a large scale that alarms at our ISPs went off in concert: null routes were placed, the attack was blocked and most of our nameservers on different continents, in different countries, in different data centers, at different ISPs were down – at once!
We spent a few hours characterizing the attack. We let people know what was happening and set out to block the influx in a way that would allow us to continue to serve DNS requests. Eventually we found a way to filter the attack very specifically and resumed service.
What are you doing to make sure this won’t affect service in the future?
- Increase capacity
Our immediate plans are to increase capacity to the point where we could ride out a similar attack with no impact on our customer DNS services. The big problem during the characterization stage was that our transit links were saturated, making direct sampling of the traffic very difficult. We are currently in discussions with providers in Asia, Europe and North America to substantially increase our IP transit capacity in each location.
- Negotiate DDoS mitigation policies
We will negotiate new favorable policies with all of our existing carriers related to dynamic DDoS mitigation system events. Our preference is that these systems are to be used to block the source of the DDoS stream and not the destination! Obviously this is easier said than done considering this attack appeared to come from over 38,000 unique source IP addresses (mostly in Europe), but we feel basic policies could have prevented these systems from disabling multiple nameservers around the world at exactly the same time. We will absolutely insist that any null route placed by a carrier in relation to our service be removed at our request, provided we have established filters to reasonably block the attack which caused the null route to be entered. We cannot be put in a position where we must wait hours for a null route to be lifted while our infrastructure remains inaccessible.
The fact that the DDoS policies with the carriers had not been specifically negotiated, opened the door for nameservers at multiple sites in the United States and the UK to become simultaneously disabled. As an example, our nameservers in Asia & Europe remained viable for most of the outage specifically because our carrier in those locations (Voxel) did not null route our traffic.
- What about AnyCast?
AnyCast would have allowed us to pin the influx of traffic on a single site, and perhaps bring up a new nameserver IP address elsewhere and swing our NS records around to recover. Unfortunately the propagation delay involved in the NS record swap would have likely lasted longer than the outage itself. Regardless of how useful AnyCast actually is, we acknowledge that the market perceives AnyCast as a benefit in such conditions.
AnyCast also has performance benefits. With AnyCast, the selection of nameservers used to resolve queries can be limited to a set which are topologically nearest to the client. By virtue of the BGP route selection algorithm, a client in Europe would be directed to nameservers on 4 unique AnyCast strings at different ISPs in the EU while a client in New Zealand would be directed to nameservers on the same AnyCast strings in Asia.
In conclusion, we are considering plans to launch new nameservers on at least two AnyCast strings by the end of the year.
What can I do to improve by DNS reliability now?
Many of our customers are astutely using multiple DNS providers to cover their DNS infrastructure. We use multiple ISPs to provide redundant connectivity at all of our data centers, why not treat your DNS infrastructure the same way? See our new FAQ article which covers how to setup Zerigo as a master DNS service to be used in concert with slave services from DME or other providers.