Self-inflicted DNS Outage

So now that I’m back online I can confess to the self-inflicted DNS outage that took down my web sites and email lists from around Noon on Saturday until now (around 1900 on Tuesday.)

First off: I am hosting my domains at 1&1. I can not recommend them and will be moving all my domains away from them shortly.

The sequence of events is kind of messy, but I’ll try and lay it out here as it happened.

I host my own DNS on a server in my basement. This server is the master for the domain zone file. There are three secondaries configured for the domain: one at the company where I used to work and two hosted by a friend.

Since I changed ISPs (from CenturyLink nee Qwest to US Internet Fiber) I had to change my IP addresses. This meant that I needed to contact the admins for the secondary servers and have them update their configs to perform transfers from the new IP address. It also meant that the IP address for my name server would be changing.

This should not be a problem, except for two issues:

1. When you run a name server that lives in the same domain that it is serving for (i.e. my domain is anansi-web.com and my name server is ns1.anansi-web.com) you have to setup something called a “glue record” at the registrar. This solves the chicken and the egg issue of trying to look up the name server for the domain which is in the domain itself. I was under the impression that I had done this in the past at 1&1, but when I tried to figure out how to change it, the web site told me they don’t support glue records. WTF. This was the primary trigger for the steps I took that caused the extended outage.

2. The company that I used to work for told me that they would like to stop hosting DNS as a secondary for me. I have no issue with this, it’s understandable. But that means I need to remove their DNS server from all my domains. Again, not an issue, but still a few more changes to make.

I decided that since I could not setup a new glue record I would just move the DNS hosting to the Amazon Web Services Route 53 service. It’s $0.50/month per domain hosted there and I figure that is worth the price so I don’t have to mess around with glue records and the like. So I setup anansi-web.com at Route 53 and it was working great.

In addition I told the friend hosting the other secondaries for me that they could stop doing so, since I was moving to Route 53.

I also decided (at the same time) that since I was annoyed at 1&1 I would start the transfer of my domain from 1&1 back to GoDaddy. I know, I know. I moved off GoDaddy to 1&1 as protest for some dumb stuff GoDaddy was doing, but I know their registrar stuff works okay, and they are cheap.

This last change was the root cause of the extended outage.

When you initiate a domain transfer from 1&1, they cease all updates to the domain records. Which means that when I changed the name servers for the domain to point at Route 53, the change never went through.

When I first contacted 1&1 about the issue, they told me that it can take 24 to 48 hours for the changes to go through, so I should just wait. Which used to be true, back in 2001. But these days it is usually about 15-30 minutes before the updates hit the root servers.

When I contacted them again today, after 48 hours had passed, that’s when they decided to tell me that the changes would not go through since I had initiated a transfer for the domain.

The transfer email from 1&1 stated that the transfer would be completed on 2013-11-28 19:35:40. That’s still two days from now!

So the situation stood thus:

  1. I can’t change the name servers at 1&1 to point to Route 53.
  2. The secondary at the company I used to work for is still listed as a name server on my domain, but the IP address for my master server has changed and the secondary will not pull the zone from it. And I can’t remove it at 1&1, which means it’s still handing out old information.
  3. The other two secondaries are turned off.

I’m pretty much dead in the water. What to do?

Today I figured it out.

  1. Since I have not yet turned off the old ISP, I can send an update to the old secondary from the original IP! So I setup BIND on my Ubuntu laptop, loaded the updated zone file with the new IP addresses, plugged the laptop into the router for the old ISP, configured the laptop IP address to be the same as the old DNS server address and sent out a notify. Then I watched the logs while the old secondary pulled the zone file! Eureka!
  2. I asked my friend to setup the other two old secondaries for me again. Luckily he had just commented them out in his config files (smart man.) So I updated the zone file on my server and they updated too.

Now we are back in business!

On Thursday when the domain transfers to GoDaddy we should see no blip. I believe that I have it setup to just start pointing to the Route 53 DNS servers, so it should just work. I will be watching though, so if there is an issue it should be an easy fix.

What did I learn from this fun excursion?

Well, it’s a lesson that I seem to keep forgetting: Only make one change at a time!

If I had left well enough alone and not started the transfer to GoDaddy then I could have changed the DNS servers to point to Route 53 and there would have been minimal down time. But no, I had to make multiple changes at the same time, and that always causes trouble.

So to the people who use my mailing lists, I’m sorry. This was entirely self-inflicted, and I’m annoyed with myself for causing such a long outage.