DevOps FMEA Shortlist

Failure Modes and Effects Analysis (FMEA) is a formal practice of breaking individual pieces of your system and watching how each failure effects the system as a whole (this is the definition I’ve seen used most in tech). Stop one server in a pool and capacity drops. No problem. Stop one server in a pool and users can’t login anymore? Problem. FMEA helps ensure that downstream effects of failures aren’t bigger than you expect.

Real FMEA is a Big Deal and is often out of reach. But, even doing small FMEA during routine development can save your life. I love a good list, and over the years I’ve assembled one from real world failures that had bigger consequences than expected. Testing those failures has saved me many times, so I thought I’d share the list.

Some failures might look the same, like rebooting and terminating, but the differences matter. I’ve seen a lot of infrastructures go down because deployment automation started app processes with nohup and didn’t create an init config to ensure they restarted on reboot.

Failure Expected Outcome
Clocks are slow (e.g. ntpd died yesterday) Alerts to operators who can respond
Config that prevents the app from starting is deployed Clear failure message from deploys (but no outage in the running system)
External service goes down (e.g. PyPI) Clear failure message from deploys (but no outage in the running system)
Database is out of space Alerts to operators who can respond
App process is killed by the Linux OOM Alerts to operators and a short drop in capacity that recovers automatically
That singleton host you haven’t gotten rid of terminates Brief outage that recovers automatically
App host reboots Short drop in capacity that recovers automatically
App host terminates Short drop in capacity that recovers automatically
App service stops on a host Short drop in capacity that recovers automatically
All app hosts terminate Brief outage that recovers automatically
Worker host reboots Short spike in queue length
Worker host terminates Short spike in queue length
Worker service stops on a host Short spike in queue length
All worker hosts terminate Slightly longer spike in queue length

There’s a lot more you’d have to test to fully exercise your system, this is just a lightweight list of the failures I’ve found most valuable to test. I usually don’t run every simulation when developing every feature, just the ones closely connected to what I’m building. It’s a good practice to simulate several failures at once. What happens if an external service like PyPI is down and a host is terminated?

The full details on why each of these is important is out of scope today, but if you’re interested in them let me know and I’ll look at covering them in future articles.

If you’re already thinking of modifications you’d need to make for this list to work for you, that’s good! Maybe you don’t depend on time sync (although you likely do, so many base components do that it’s hard to avoid). Maybe you have a legacy queue system that has unique failure cases (like that time the instance type changed which changed the processor count which made it run out of licenses). Maybe you don’t depend on singletons (huzzah!). Adjust as needed.

Happy automating!

Adam

3 Tools to Validate CloudFormation

Hello!

Note: If you just want the script and don’t need the background, go to the gist.

If you found this page, SEO means you probably already found the AWS page on validating CloudFormation templates. If you haven’t, read that first. It’s a better starting place.

I run three tools before applying CF templates.

#1 AWS CLI’s validator

This is the native tool. It’s ok. It’s really only a syntax checker, there are plenty of errors you won’t see until you apply a template to a stack. Still, it’s fast and catches some things.

aws cloudformation validate-template --template-body file://./my_template.yaml

Notes:

  • The CLI has to be configured with access keys or it won’t run the validator.
  • If the template is JSON, this will ignore some requirements (e.g. it’ll allow trailing commas). However, the CF service ignores the same things.

#2 Python’s JSON library

Because the AWS CLI validator ignores some JSON requirements, I like to pass JSON templates through Python’s parser to make sure they’re valid. In the past, I’ve had to do things like load and search templates for unused parameters, etc. That’s not ideal but it’s happened a couple times while doing cleanup and refactoring of legacy code. It’s easier if the JSON is valid JSON.

It’s fiddly to run this in a shell script. I do it with a heredoc so I don’t have to write multiple scripts to the filesystem:

python - <<END
import json
with open('my_template.json') as f:
    json.load(f)
END

Notes:

  • I use Python for this because it’s a dependency of the AWS CLI so I know it’s already installed. You could use jq or another tool, though.
  • I don’t do the YAML equivalent of this because it errors on CF-specific syntax like !Ref.

#3 cfn-nag

This is a linter for CloudFormation. It’s not perfect. I’ve seen it generate false positives like “don’t use * in IAM policy resources” even when * is the only option because it’s all that’s supported by the service I’m writing a policy for. Still, it’s one more way to catch things before you deploy, and it catches some good stuff.

cfn_nag_scan --input-path my_template.yaml

Notes:

  • Annoyingly, this is a Ruby gem so you need a new dependency chain to install it. I highly recommend setting up RVM and creating a gemset to isolate this from your system and other projects (just like you’d do with a Python venv).

Happy automating!

Adam