DevOps FMEA Shortlist

Update June 2019: Added cache server failures.

Failure Modes and Effects Analysis (FMEA) is a formal practice of breaking individual pieces of your system and watching how each failure effects the system as a whole (this is the definition I’ve seen used most in tech). Stop one server in a pool and capacity drops. No problem. Stop one server in a pool and users can’t login anymore? Problem. FMEA helps ensure that downstream effects of failures aren’t bigger than you expect.

Real FMEA is a Big Deal and is often out of reach. But, even doing small FMEA during routine development can save your life. I love a good list, and over the years I’ve assembled one from real world failures that had bigger consequences than expected. Testing those failures has saved me many times, so I thought I’d share the list.

Some failures might look the same, like rebooting and terminating, but the differences matter. I’ve seen a lot of infrastructures go down because deployment automation started app processes with nohup and didn’t create an init config to ensure they restarted on reboot.

Failure Expected Outcome
Clocks are slow (e.g. ntpd died yesterday) Alerts to operators who can respond
Config that prevents the app from starting is deployed Clear failure message from deploys (but no outage in the running system)
External service goes down (e.g. PyPI) Clear failure message from deploys (but no outage in the running system)
Database is out of space Alerts to operators who can respond
App process is killed by the Linux OOM Alerts to operators and a short drop in capacity that recovers automatically
That singleton host you haven’t gotten rid of terminates Brief outage that recovers automatically
App host reboots Short drop in capacity that recovers automatically
App host terminates Short drop in capacity that recovers automatically
App service stops on a host Short drop in capacity that recovers automatically
All app hosts terminate Brief outage that recovers automatically
Worker host reboots Short spike in queue length
Worker host terminates Short spike in queue length
Worker service stops on a host Short spike in queue length
All worker hosts terminate Slightly longer spike in queue length
Cache host reboots Temporarily increased response times, some users are logged out
Cache host terminates Temporarily increased response times, some users are logged out
Cache service stops on a host Temporarily increased response times, some users are logged out

There’s a lot more you’d have to test to fully exercise your system, this is just a lightweight list of the failures I’ve found most valuable to test. I usually don’t run every simulation when developing every feature, just the ones closely connected to what I’m building. It’s a good practice to simulate several failures at once. What happens if an external service like PyPI is down and a host is terminated?

The full details on why each of these is important is out of scope today, but if you’re interested in them let me know and I’ll look at covering them in future articles.

If you’re already thinking of modifications you’d need to make for this list to work for you, that’s good! Maybe you don’t depend on time sync (although you likely do, so many base components do that it’s hard to avoid). Maybe you have a legacy queue system that has unique failure cases (like that time the instance type changed which changed the processor count which made it run out of licenses). Maybe you don’t depend on singletons (huzzah!). Adjust as needed.

Happy automating!

Adam

If this was helpful and you want to save time by getting “copy and paste” patterns for Cloud DevOps in your inbox, subscribe here.