image of damage from a wind storm

Yesterday’s website outage

We had a windstorm here in Castlegar yesterday (August 9th), and that led to a three hour power outage between 5:00 PM and 8:00 PM pacific time yesterday. Three hours is far longer than my aging UPS can keep the blog servers and network paraphernalia here running, so both of my blogs went offline.

But a bit of a comedy of errors resulted in the outage being extended by another two or three hours. This post is about those problems and how I fixed them.

Enjoying the unpowered silence

It was a bit of a strange evening as the power outage dragged on. Irene and I sat on our porch, listening to the birds and occasional roaring of the wind through our trees..

Once the wind died down what I really noticed was the silence. We live in an acreage subdivision so it is usually fairly quiet, but there are lots of smaller background noises from various powered devices that I hardly even notice normally, Without electricity the silence becomes very apparent.

Our talk turned to possibly going into town for supper, but Irene bailed on that idea as the sun began to set to go have a nap. I poured myself a wee dram of scotch and let the darkness settle. Just as I finished my drink, around 8:20 PM, the power was restored.

The power comes back on

Irene woke up with the restoration of the power, and immediately wanted to revert to our usual evening hours of TV watching. Unfortunately, the network connectivity we use for our streaming services was not quite operational.

It took me a few minutes to get network services restored: my UPS had gone completely offline and required a reboot before all the attached gear powered up. This was quick and easy, and streaming access was restored!

Unfortunately, my attempts to interact with my blogs met with failure. I had to delay resolving this issue until after Irene and I had watched our fill of programs, though: the wife’s wants and needs come before restoring my blogs. Priorities!

SSL breakage and the ACME certificate manager errors

Once Irene had her fill of TV time, she went off to bed. This freed me up to investigate why my blogs were still offline. I tried connecting to both my sites with Safari, and kept getting ‘cannot connect’ messages that my brain refused to parse properly.

Can’t open the page… reading comprehension failed me for several minutes while looking at messages like this

Finally I grasped the message: it wasn’t saying it couldn’t connect, it was saying it couldn’t connect securely. So I tried the same thing with Chrome, which helpfully explained that the SSL certificate was invalid and gave me the option to be dangerous and go to the site anyway.

<sarcasm>Thanks, Safari, for protecting me so thoroughly! </sarcasm>

Armed with the knowledge that my SSL configuration was ‘invalid’, I now had a better idea of where to look. My site SSL certificates are provided by LetsEncrypt and managed by my pfSense firewall. Something in that configuration was not working as intended.

The basic process is that LetsEncrypt certificates expire after 90 days: the pfSense firewall uses a specification called ACME (Automatic Certificate Management Environment) to renew these certificates before they expire. This has been working without issue for me for a couple of years.

LetsEncrypt has several challenge types it can use to confirm that the request for the certificate renewal is ‘legitimate’. I use the DNS mechanism, and this requires a bit of special configuration with my DNS provider (Cloudflare) so ACME can do the necessary things to the DNS record to tell LetsEncrypt that my request is valid.

It took me a some fiddling to figure out that my continuing site outage was due to a perfect storm (see what I did there?) of failures:

  1. ACME had been trying and failing to renew my SSL certificates for several days, apparently without notifying me
  2. My LetsEncrypt certificates had expired sometime on August 9th in the midst of the power outage
  3. I had changed the ACME configuration several months ago (while my LetsEncrypt certificates were still valid) to include a certificate for my second blog (Geek on a Harley). When I did this I neglected to test renewal, and actually Cloudflare was rejecting the configuration I was using now that it included two different DNS entries

The fix was non-obvious but very simple: remove one property from the request ACME was using when connecting to CloudFlare to interact with my DNS entries. I determined which property to remove through trial and error: sometimes that’s the fastest way.

Conclusions

I now know a bit more about how ACME, LetsEncrypt, and my DNS provider (CloudFlare) interact. These are things I probably should have known earlier.

I also am thinking I need a bigger/healthier UPS. I doubt that I could keep things running for three hours, but it might be nice to get closer to a couple of hours of power outage ‘safety’ for my home network components.

And finally: keeping my wife happy is more important than keeping my blogs at 100% uptime. Sorry, but that’s just how the priorities stand in my life 🙂

5 thoughts on “Yesterday’s website outage”

  1. #1. Your website was down?
    #2. Your blog is excessively complex, but you do you 😉
    #3. Get whole home backup. You’ll need the big battery when you put solar panels on your roof anyway. Heck even more I’ve been considering a generator/big battery what with our grid keeps kicking out anytime it’s hot or cold.

  2. Good to see you here, Chris!

    #1. All three people who visit my blog were outraged! 😉

    #2. A lot of the complexity of my network and server design comes from a desire to secure our home network, including the blog, and to automate it to avoid me having to remember to do things to keep it secure. But yeah, of course my home network / blogging infrastructure is complex: otherwise, how could I possibly procrastinate about writing any blog posts?

    #3. Whole home backup is something I want, one day, if and when we get solar. But given that that would cost more than a new car I’m unlikely to go that far just to keep my blog running during our approximately twice yearly power interruptions.

  3. The question mark was supposed to be a grinning emoji. Having a site that doesn’t recognize android emoji speak and forcing the use of fully formatted grammatical sentences is clearly a gen x passive aggressive power move designed to exclude Gen z persons.

    I think I approve.

  4. “Android emojis”: sorry, only a few simple emoticons are spoken here 😉

    You might be able to use the emoji picker on Windows and macOS per this doc=> https://wordpress.com/support/emoji/

    P.s.: the ‘reply’ link for nesting comments is currently not working. I have no idea why: I will blame the power failure as an easy excuse

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.