Update June 2019: Added cache server failures.
Failure Modes and Effects Analysis (FMEA) is a formal practice of breaking individual pieces of your system and watching how each failure effects the system as a whole (this is the definition I’ve seen used most in tech). Stop one server in a pool and capacity drops. No problem. Stop one server in a pool and users can’t login anymore? Problem. FMEA helps ensure that downstream effects of failures aren’t bigger than you expect.
Real FMEA is a Big Deal and is often out of reach. But, even doing small FMEA during routine development can save your life. I love a good list, and over the years I’ve assembled one from real world failures that had bigger consequences than expected. Testing those failures has saved me many times, so I thought I’d share the list.
Some failures might look the same, like rebooting and terminating, but the differences matter. I’ve seen a lot of infrastructures go down because deployment automation started app processes with
nohup and didn’t create an init config to ensure they restarted on reboot.
|Clocks are slow (e.g.
||Alerts to operators who can respond|
|Config that prevents the app from starting is deployed||Clear failure message from deploys (but no outage in the running system)|
|External service goes down (e.g. PyPI)||Clear failure message from deploys (but no outage in the running system)|
|Database is out of space||Alerts to operators who can respond|
|App process is killed by the Linux OOM||Alerts to operators and a short drop in capacity that recovers automatically|
|That singleton host you haven’t gotten rid of terminates||Brief outage that recovers automatically|
|App host reboots||Short drop in capacity that recovers automatically|
|App host terminates||Short drop in capacity that recovers automatically|
|App service stops on a host||Short drop in capacity that recovers automatically|
|All app hosts terminate||Brief outage that recovers automatically|
|Worker host reboots||Short spike in queue length|
|Worker host terminates||Short spike in queue length|
|Worker service stops on a host||Short spike in queue length|
|All worker hosts terminate||Slightly longer spike in queue length|
|Cache host reboots||Temporarily increased response times, some users are logged out|
|Cache host terminates||Temporarily increased response times, some users are logged out|
|Cache service stops on a host||Temporarily increased response times, some users are logged out|
There’s a lot more you’d have to test to fully exercise your system, this is just a lightweight list of the failures I’ve found most valuable to test. I usually don’t run every simulation when developing every feature, just the ones closely connected to what I’m building. It’s a good practice to simulate several failures at once. What happens if an external service like PyPI is down and a host is terminated?
The full details on why each of these is important is out of scope today, but if you’re interested in them let me know and I’ll look at covering them in future articles.
If you’re already thinking of modifications you’d need to make for this list to work for you, that’s good! Maybe you don’t depend on time sync (although you likely do, so many base components do that it’s hard to avoid). Maybe you have a legacy queue system that has unique failure cases (like that time the instance type changed which changed the processor count which made it run out of licenses). Maybe you don’t depend on singletons (huzzah!). Adjust as needed.