So much for 99% uptime…

The little server in my basement that runs all my blogs has been down since February 8th. I finally got it back up and running today and, after correcting a few hiccups, the blogs are back up as of February 12th at 10:15 PM Pacific.

What went wrong? And why did it take so long to repair? Well… that’s a bit of an interesting story, but only if you are technically minded. The short, non-technical answer: there was a problem with the storage, and I fixed it by replacing the failing part.

Read on for the longer and more complicated answer…

What went wrong?

*The server in question: it is a ‘Next Unit of Computing’ (NUC), and is about the size of my hand*

I periodically perform a full Fedora Linux operating system upgrade (as opposed to an update) on the blog server to keep up with the current releases. So on Saturday I performed an upgrade from Fedora release 40 to release 41, and that’s where the problems started.

The last stage of the upgrade performs a restart of the server and completes installation of the OS kernel. The server failed to come back up successfully after this restart, which starts a whole bunch of challenges. I was unable to connect remotely and, as the little Linux box normally runs ‘headless’- i.e.: no display, keyboard, or mouse, I had no way to view the error messages. So step number 1 was scrounging a display and input devices.

Once I had a monitor attached I found a completely blank display with a flashing cursor in the upper left. Normally Linux allows you to switch to a full error display with the ESC key, but this didn’t seem to be working. A bit of fiddling got me to the GRUB menu where I interrupted the normal boot, edited the boot command, and removed a ‘quiet’ setting that I suspected was likely blocking useful messages.

Booting with ‘quiet’ off produced some interesting errors:

*NVMe is storage… that doesn’t look so good*

The NVMe device is an NVMe M.2 high-performance SSD- that is basically the ‘hard disk’ for the computer. The error implied that it wasn’t working, which would be Bad News™. So I tried the easy thing: switching back to an older Fedora kernel. I tried the Fedora 40 kernel (which had been working) and also the 39 kernel: the same error occurred. This didn’t guarantee that the problem was ‘hardware’: switching back to an older kernel doesn’t ‘roll back’ all of the non-kernel os components like device drivers. But it was making me suspicious.

I needed more tools to debug the problem, so I created a Fedora release 41 ‘Live’ boot device e.g.: a ‘thumb’ drive. This kicked off a whole series of comic moments trying to find one of my good 64 GB thumb drives, getting the FedoraMediaWriter to work with macOS (they don’t sign the thing, and the default image is for Intel processors), but I got it done.

The boot image worked perfectly. And I was able to successfully mount the LVM volumes that reside on the NVMe device without a single issue. So the NVMe was ‘working’ as a storage device, it just wasn’t booting successfully with Linux. Mounting the system volumes gave me the ability to experiment with the Linux ‘fstab‘ configuration on the device to cause the boot to selectively mount one of the LVMs at a time, so that’s what I did.

Boot was ‘successful’ (for a very limited definition of success) with every LVM except for the largest one: /var, which is where all of the core services live such as the database and the HTTP server. So although I could boot with the server with the NVMe, I couldn’t really do anything.

However, the storage was all there when I booted from the USB and mounted the volumes. So I started a backup to get the latest files somewhere safe, and while that ran I did some digging.

The hypothesis

By this point in the process it was later in the day on Sunday and I’d had about 4 hours of sleep. But I found a few things to try and, when the backup was complete, I tried them.

The first things I found related to some general issues regarding booting problems after upgrading from Fedora 40 to 41. These were intriguing, but I couldn’t find an exact match to my issues.

Then I found a number of posts relating to NVMe devices and Fedora 41: this one is an example. I also found posts regarding NVMe timing issues during boot, so I experimented changing ‘fstab’ to add various delay mechanisms using the options e.g.: ‘_netdev,x-systemd.automount‘ which basically instructs the OS to delay activating the device until the network services are up and to only mount the volume when it is accessed instead of immediately. And I changed the device references to use UUIDs instead of device names on the off chance that might make a difference.

None of these changes solved the problem. I did find a few vague references to failures of the older Western Digital NVMe controllers that made me start thinking more in terms of hardware failure instead of just simple timing issues.

This led me to the following:

Hypothesis: the Western Digital NVMe (WDS512GIXOC-00ENXO circa 2017) has entered a partial failure state

Possible triggering events:

OS compatibility: the device is incompatible with newer OS versions. Upgrading the OS ’caused’ the failure to manifest
Physical failure: elements of the device have physically failed. Restarting the computer ’caused’ the failure to manifest

Testing the hypothesis: buy new hardware

I ordered a replacement NVMe after doing a little checking to see if there was a ‘best’ option. I read good things about the Samsung 990 Pro, so that’s what I went with. I also ordered a little gadget to perform a low-level ‘clone’ of the old NVMe SSD to the new one.

*The NVMe copying device. I find the name to be rather comical*

The deliver was pretty fast, but was still a couple of days: that’s where most of my server’s downtime comes from. And the whole time I was wondering whether my diagnosis was correct. Would the new SSD fix the issue, or would I simply be adding another bit of data to my debugging?

Both pieces arrived today. I copied the old SSD’s content to the new one, plugged it back in to the computer, and it booted the very first time. My hypothesis was correct: the failure was was definitely caused by a hardware problem with the old Western Digital NVMe.

There were, however, a couple of less vexing issues to resolve relating to the Fedora OS upgrade itself before my blogs fully returned to life.

Valkey replaces Redis

The most notable issue I encountered with the upgrade from Fedora 40 to 41: Valkey has replaced Redis. I have been using Redis to enable object caching with WordPress, so the change impacted me. Normally I wouldn’t mind one way or another: Redis is just a tool, and if Valkey works the same way then that’s fine by me. Unfortunately, Valkey didn’t work, at least not to start with.

There were two problems. Firstly, Valkey couldn’t access its logs which caused it to fail when starting up. This had a fairly obvious fix: change the ownership of the logs in /val/log/redis from redis:redis to valkey:valkey– or just delete any existing logs at that location and let Valkey create its own there.

The second problem was that the Valkey service wasn’t configured to automatically start although the service it replaced (Redis) most definitely was. This took me the better part of an hour and several reboots to figure out: sometimes my mind works in mysterious ways.

My three blogs all started up successfully once both of these problems were resolved and, so far as I can tell with quick testing, everything is working fine now. So the Fedora upgrade went pretty well- if the hardware had worked everything would have gone nicely.

It’s fixed, so… hurray?

I’m fairly sure that the SSD failure was an outright hardware fault rather than an OS incompatibility with outdated kit… but I’m not 100% certain. If I could go back in time and perform a reboot of the computer before starting the OS upgrade then I guess I could know for sure, or at least with more certainty than I have now. I can say that the server had been ‘up’ for over 150 days when I restarted it. That kind of uptime is no big deal for my Linux servers, but the SSD problem could have been a ticking time bomb for months.

The total outage duration was four full days. If that was my only downtime for the year I could say my sites had 98.9% uptime. But I break things pretty frequently for a few hours at a stretch to do upgrades and such, so… not exactly high availability, I guess.

My main take-away: 7+ year old hardware is not really trustworthy even if it has almost no moving parts. As for the rest: I’ve always known that performing upgrades is a great way to enjoy a surprising long stay camped out next to the server.

What went wrong?

The hypothesis

Testing the hypothesis: buy new hardware

Valkey replaces Redis

It’s fixed, so… hurray?

Like this:

Leave a ReplyCancel reply

What went wrong?

The hypothesis

Testing the hypothesis: buy new hardware

Valkey replaces Redis

It’s fixed, so… hurray?

Share this:

Like this:

You Might Also Like

Blaugust 2024 numbers

New layout more or less settled

Site updates in progress…

Leave a ReplyCancel reply