DevOps FMEA Shortlist

Update June 2019: Added cache server failures.

Failure Modes and Effects Analysis (FMEA) is a formal practice of breaking individual pieces of your system and watching how each failure effects the system as a whole (this is the definition I’ve seen used most in tech). Stop one server in a pool and capacity drops. No problem. Stop one server in a pool and users can’t login anymore? Problem. FMEA helps ensure that downstream effects of failures aren’t bigger than you expect.

Real FMEA is a Big Deal and is often out of reach. But, even doing small FMEA during routine development can save your life. I love a good list, and over the years I’ve assembled one from real world failures that had bigger consequences than expected. Testing those failures has saved me many times, so I thought I’d share the list.

Some failures might look the same, like rebooting and terminating, but the differences matter. I’ve seen a lot of infrastructures go down because deployment automation started app processes with nohup and didn’t create an init config to ensure they restarted on reboot.

Failure Expected Outcome
Clocks are slow (e.g. ntpd died yesterday) Alerts to operators who can respond
Config that prevents the app from starting is deployed Clear failure message from deploys (but no outage in the running system)
External service goes down (e.g. PyPI) Clear failure message from deploys (but no outage in the running system)
Database is out of space Alerts to operators who can respond
App process is killed by the Linux OOM Alerts to operators and a short drop in capacity that recovers automatically
That singleton host you haven’t gotten rid of terminates Brief outage that recovers automatically
App host reboots Short drop in capacity that recovers automatically
App host terminates Short drop in capacity that recovers automatically
App service stops on a host Short drop in capacity that recovers automatically
All app hosts terminate Brief outage that recovers automatically
Worker host reboots Short spike in queue length
Worker host terminates Short spike in queue length
Worker service stops on a host Short spike in queue length
All worker hosts terminate Slightly longer spike in queue length
Cache host reboots Temporarily increased response times, some users are logged out
Cache host terminates Temporarily increased response times, some users are logged out
Cache service stops on a host Temporarily increased response times, some users are logged out

There’s a lot more you’d have to test to fully exercise your system, this is just a lightweight list of the failures I’ve found most valuable to test. I usually don’t run every simulation when developing every feature, just the ones closely connected to what I’m building. It’s a good practice to simulate several failures at once. What happens if an external service like PyPI is down and a host is terminated?

The full details on why each of these is important is out of scope today, but if you’re interested in them let me know and I’ll look at covering them in future articles.

If you’re already thinking of modifications you’d need to make for this list to work for you, that’s good! Maybe you don’t depend on time sync (although you likely do, so many base components do that it’s hard to avoid). Maybe you have a legacy queue system that has unique failure cases (like that time the instance type changed which changed the processor count which made it run out of licenses). Maybe you don’t depend on singletons (huzzah!). Adjust as needed.

Happy automating!

Adam

Securing IAM Policies

SecurityIdentityCompliance_GRAYSCALE_IAM

Since the beginning, writing IAM policies with the minimum necessary permissions has been hard. Some services don’t have resource-level permissions (you have to grant to *), but then later they do. When a service has resource-level permissions, it may only be for some of its permissions (the rest still need *). Some services have their own Condition Operators (separate from the global ones) that may or may not help you tighten control. Et cetera. The details are documented differently for each service and it’s a lot of hunting and testing to try to put together a tight policy.

Amazon made it easier! There’s new magic in the IAM UI to help you create policies. It has some limitations, but it’s a big improvement. Here are some of the things it can do that I used to have to do myself:

  • Knows which S3 permissions require the resource list to include a bucket name and which require the bucket name and an object path.StatementSplitting
  • Tries to group permissions and resources into statements when it results in equivalent access (but sometimes ends up granting extra access, see below).StatementGrouping
  • Knows when a service doesn’t support resource-level permissions.ResourceSpecificPermissionsDetection
  • Knows about the Condition Operators specific to each service (not just the global ones).ConditionOperators

There are some limitations:

  • Doesn’t deduplicate. If you add permissions it doesn’t go back and put them into existing statements, it just adds new statements that may duplicate parts of old ones.
  • Only generates JSON, so if you’re writing a YAML CloudFormation template you should translate.
  • Seems to have limited form validation on Condition Operators. You can put in strings that will never match because the API calls for that service can’t contain what you entered (making the statement a no-op).
  • Can end up grouping permissions in a way that makes some resource restrictions meaningless and grants more access than might be expected.TooMuchPermission
  • Sometimes it messes up the syntax. Seems to happen if you don’t put exactly what it expects into the forms.Bug

 

So there are a few problems, but this is still way better than it was before! My plan is to use the visual editor to write policies, then go through and touch it up afterward. Based on what I’ve seen so far, this cuts the time it takes me to develop policies by about 30%.

Happy securing,

Adam

The Fallacy of Rest

Hello!

A while back I made a bad scheduling mistake. I knew about the anti-pattern that caused it, but didn’t see myself using it. It forced me to push out dates that cost me some money.

Later I looked back to see what went wrong. It was exactly what I have advised others not to do. It’s easy to miss! I’m writing this article to re-expose the anti-pattern I used.

The project was Move to a New City. I would be taking my job with me. This is the schedule I wrote:

  • Week 1
    • Pack
    • Work
  • Week 2
    • Weekdays
      • Pack
      • Work
      • Clean
    • Weekend
      • Clean
      • Say goodbye to friends
  • Week 3
    • Monday (Vacation Day)
      • Exercise and rest
      • Say goodbye to friends
    • Tuesday (Vacation Day)
      • Return keys
      • Drive to new city (5 hours on the road)
      • Check in to AirBnB
      • Hang out with friend who lives in new city
    • Wednesday through Friday
      • Work
      • Look at new housing

Seems fine! I even budgeted time to exercise.

Tuesday of week 3. 100% on schedule. It’s bedtime and I’m watching an episode of The Dick Van Dyke Show on my laptop and laughing myself to sleep with Mary Tyler Moore’s performance. I feel awesome. I sleep like I’ve just run a marathon.

Wednesday. Mild headache (whatever – I’m an engineer, we get headaches). I catch up on work, message about a couple rentals, and attend the morning meetings. As the meetings are wrapping up I get a reply on a rental with a proposed time to view it. I can just barely make it, so I head out.

See the mistake yet? I still hadn’t. Wednesday was a busy day and I felt rushed, but I’ve had lots of busy days. I just kept going. I didn’t make the mistake on Wednesday.

That afternoon I got one more email about a rental. It was a wafer-thin mint (see Monty Python’s The Meaning of Life ⬅️ this is how I am making the post about Python). Suddenly getting through the rest of my inbox felt like climbing a mountain. I was burnt out.

The mistake happened when I first wrote the schedule. Here’s the fallacy I used:

People are like horses. Rest them two hours a day and one full day every week or so and they’re fine. Feed and water three times a day.

People are not like horses. They can’t sustain themselves on periodic rest intervals.

Here’s how people work:

Productive workers have a budget of hours per week. When those hours are spent they spend themselves to keep going. Once too much of themselves is gone, they stop producing.

I wrote a schedule in the mindset of making sure I had rest intervals, but I should have figured out the hours needed and divided that by my sustainable weekly hours (a number I’ve learned during two decades of working). That would be the total weeks really needed to complete the move.

Going back over the hours I spent I found I had scheduled 200% of my sustainable capacity and had expected to sustain that for most of a month. (╯°□°)╯︵ ┻━┻

Another way to look at my mistake is that I didn’t count saying goodbye to friends as work (just like I sometimes forget to count attending meetings as work). In the context of human capacity, leaving behind your friends is absolutely work (just like sitting in a frustrating meeting is). It drains your budget of hours. If you do too much of it, you exhaust.

To write a schedule that workers can reliably complete, budget based on what workers can do per week and make sure you get that amount from their real history of work. Don’t make it up, look back at the past and compute it.

I’m going to bed. Happy scheduling!

Adam

A Book from 2017: Stretch Goals and Prescriptions

Happy New Year!

Today’s post is a little outside my usual DevOps geekery, but it’s been an influencer on my work and my career choices this year so I wanted to share it.

For the record, I have zero connections to 3M.

In my teens, I noticed that whenever I bought something with the 3M logo it was noticeably better than the other brands. I didn’t know what 3M was, but this pattern kept repeating and I started to always choose them. Years later, deep inside a career in technology, I was still choosing 3M. I started to ask myself how they did it. Why were all their products better than everyone else’s?

I didn’t know anyone at 3M, so I found a book. The 3M Way to Innovation: Balancing People and Profit.

the3mwaytoinnovation.jpg

Balance? At work? And still better than everyone else? Bring it on.

The book approaches 3M through their innovations. They built hugely successful product lines in everything from sandpaper to projectors, and it turns out other companies have long looked to them as the top standard for the innovation that drives such diverse success. As I worked through the book, one thing really stuck with me: 3M’s definition of Stretch Goals.

I’ve seen a lot of managers ask their teams what can be accomplished in the next unit of time (sprint, quarter, etc.). Often, the team replies with a list that’s shorter than the manager would like. The manager then over-assigns the team by adding items as “stretch goals”. If the team works hard enough and accomplishes enough, they’ll have time to stretch themselves to meet these goals. The outcome I usually see is pressure for teams to work longer hours (with no extra pay) so they can deliver more product (at no extra cost to the company).

This book described 3M’s stretch goals very differently, which I’ll summarize in my own words because it’s characterized throughout the book and there’s no single quote that I think captures it. 3M sets these goals to stretch an aspect of the business that’s needed for it to remain a top competitor, and they’re deliberately ambitious. For example, one that 3M actually used: 30% of annual sales should come from products introduced in the last four years. Goals like these drive innovation because they’re too big to meet with the company’s current practices.

The key difference is that 3M isn’t trying to stretch the capacity of individuals. They’re not trying to increase Scrum points by pushing everyone to work late. They’re setting targets for the company that are impossible to meet unless the teams find new ways to work. They’re driving change by looking for things that can only be done with new approaches; things that can’t be done just by working longer hours. And after they set these goals, they send deeply committed managers out into the trenches to help their teams find and implement these changes. Most of the book is about what happens in those trenches. I highly recommend it.

There’s one other thing from the book I want to highlight: the process of innovation doesn’t simplify into management practices you can choose off a menu. There’s more magic to it than that. It takes skilled leaders and a delicate combination of freedom and pressure to build a company where the best engineers can do their best work, and trying to reduce that to a prescription doesn’t work. Here’s a quote from Dick Lidstad, one of the 3M leaders interviewed for the book, talking about staff from other companies who come to 3M looking to learn some of the innovation practices so they can implement them in their own teams:

They want to take away one or two things that will help them to innovate. … We say that maintaining a climate in which innovation flourishes may be the single biggest factor overall. As the conversation winds down, it becomes clear that what they want is something that is easily transferable. They want specific practices or policies, and get frustrated because they’d like to go away with a clear prescription.

I heard truth in that quote. Despite being a believer in the value of tools like Scrum, which are supposed to foster creativity and innovation, I’ve spent a lot of my career held back by the overhead of process that’s good in principle but applied with too little care to be effective. Ever spent an entire day in Scrum ceremonies? There’s more value in the experience of 3M’s teams overall than there is in any list of process.

This book was written in 2000, but not only has 3M stock continued to perform well, I found many parallels in the stories this author tells and my own experience in the modern tech world. It’s heavy with references and first-hand interviews, and I think it’s a valuable read for anyone in tech today.

If you read it, let me know what you think!

Adam

Production-ready Scripts and Python

Production is hard. Even a simple script that looks up queue state and sends it to an API gets complex in prod. Without tests, the divide by zero case you missed will mask queue overloads. Someone won’t see that required argument you didn’t enforce and break everything when they accidentally publish a null value. You’ll forget to timestamp one of your output lines and then when the queue goes down you won’t be able to correlate queue status to network events.

Python can help! Out of box it can give you essential but often-skipped features, like these:

  • Automated tests for multiple platforms.
  • A --simulate option.
  • Command line sanity like a --help option and enforcement of required arguments.
  • Informative log output.
  • An easy way to build and package.
  • An easy way to install a build without a git clone.
  • A command that you can just run like any other command. No weird shell setup or invocation required.

It can be a little tricky, though, if you haven’t already done it. So I wrote a project that demonstrates it for you. It includes an example of a script that isn’t ready for prod.

Hopefully this will save you from some of the many totally avoidable, horrible problems that bad scripts have caused in my prods.

Thanks for reading!

Adam

Pear: A Better Way to Deploy to AWS

Update 29 September 2018: Since the release of AWS Step Functions, this pattern is out of date. Orchestrating deployment with a slim lambda function is still a solid pattern, but Step Functions could make the implementation simpler and could provide more features (like graphical display of deployment status). You can even see some of the ways this might work in AWS’s DevOps blog. Those implementation changes would drive some re-architecture, too. Because those are major changes, for now Pear is an interesting exploration of a pattern but isn’t ready for use in new deployments.

A while back I wanted to put a voice interface in front of my deployment automation. I think passwords on voice interfaces are annoying and aren’t secure, so I wanted an unauthenticated system. I didn’t want to lay awake at night worried about all the things people could break with it, so I set out to engineer a deployment infrastructure that I could put behind voice control and still feel was secure and reliable.

After a lot of learning (the journey towards this is where the Life Coach and the Better Alexa Quick Start came from) and several revisions, I had built my new infrastructure. For reasons I can’t remember, I called it Pear.

Pear is designed to make it easy to add slick features like voice interfaces while giving you enough control to stay secure and enough stability to operate in production. It’s meant for infrastructures too complex to fit in tools like Heroku or Elastic Beanstalk, for situations where you need to be able to turn all the knobs. I think it achieves those goals, so I decided to publish it. Check out the repo for an example implementation and more complete documentation of its features, but to give you a taste here’s a diagram of the basic architecture:

pear architecture

Cheers!

Adam

The Service Checklists

One day I get a text from the illimitable Kai Davis. He’s had a Bad Moment.

Adam. I have terrible OpSec.

A former user had deleted a bunch of files. Luckily, he was able to recover.

Teach me how to OpSec.

No worries buddy. I got you.

Kai is a power user, and in today’s Internet that means he subscribes to two dozen hosted services. How do you manage two dozen services and keep any kind of sanity? I do it with checklists (⬅️ read this book).

Before I show them to you, we need to cover one of the Big Important Things from Mr. Gawande’s book. Kai already knows how to manage his services. He just needs to make sure he hasn’t forgotten something important like disabling access for former users.

I wrote Kai two checklists. One to use monthly to make sure nothing gets missed and one to use when setting up new services to reduce the monthly work. I assume he has a master spreadsheet listing all his services. Kai’s Bad Moment categorizes as OpSec, but I didn’t limit these lists to that category.

Hopefully, these help you as well.

The Monthly Checklist

  • Can I cancel this service?
  • Should I delete users?
  • Should I change shared passwords?
  • Should I un-share anything?
  • Should I force-disconnect any devices?
  • Is the domain name about to expire?
  • Is the credit card about to expire?
  • Am I paying for more than I use?
  • Should I cancel auto-renewal?
  • Are there any messages from the provider in my account? (new!)
  • Is the last backup bigger than the one before it?

The Setup Checklist

  • Add row to master spreadsheet.
  • Save URL, account ID, username, password, email address, and secret questions in 1password.
  • Sign up for paperless everything.
  • Enter phone number and mailing address into account profile.
  • Review privacy settings.
  • Enable MFA.
  • Send hardcopy of MFA backup codes offsite.
  • Setup recurring billing.
  • Set alarm to manually check the first auto-bill.
  • Set alarm to revisit billing choices.
  • Set schedule for backups.
  • Check that backups contain the really important data.
  • Create a user for my assistant.
  • Confirm my assistant has logged in.

Some Notes

Monthly

  • Can I cancel this service? I always ask “can I”, not “should I”. There’s always a reason to keep it, but I want a reason to nuke it.
  • Am I paying for more than I use? I look at current usage, not predicted usage. The number is often not actionable, but it’s a good lens.

Setup

  • Save URL, account ID, username, password, email address, and secret questions in 1password. The URL matters because 1password will use it to give you warnings about known vulnerabilities that you need to change your password to remediate. The email address and username may seem redundant, but having both has saved me a bunch of times. Same with secret questions.
  • Enter phone number and mailing address into account profile. These make recovery and support calls easier.
  • Review privacy settings. Remember, Kai already knows how to manage his services. He knows how to pick good privacy settings. But privacy settings are often hidden and it’s easy to forget them when signing up.
  • Enable MFA. I know it sucks, but the security landscape gets worse every day. Use this for anything expensive or private.
  • Send hardcopy of MFA backup codes offsite. I have watched people spend months on account recovery when their phones die and they lose their Google Auth.
  • Set alarm to manually check the first auto-bill. This saves me all the time. All. The. Time.
  • Set alarm to revisit billing choices. This has saved me thousands of dollars.
  • Set schedule for backups. Even if it’s an alarm to do a manual backup once a month.

Stay safe!

Adam

Credit Card Debt

Technical debt is like a new credit card, it often comes with a 0% introductory interest rate. In the short term tech debt can look like a win; you get the new feature on time, you automate a manual process, you patch the bug. Maybe the implementation wasn’t perfect, but dealing with a bit of funky code or living with a few bugs is better than missing the deadline.

That loan comes due right away, you have to live with what you wrote, but the interest comes later. In a month (or three or six) something will happen that magnifies the impact of that funkiness or those bugs. You’ll need an unrelated feature but because you monkey-patched in the config for the first feature you’ll be forced to rewrite the config system before you can start, adding days to your timeline. You’ll need to install a zero-day security patch but your runtime hack will force you to shut down before you can patch, causing an outage.

Like a credit card, tech debt is manageable. If you pay back the new card on the right schedule it can get you the new TV without making you miss a rent payment. If you clean up your runtime hack in the next couple weeks it’s unlikely that a zero-day patch will be released before you’re done. If you don’t pay it back or you take out too many new cards, you can end up like the guy who makes six figures but rents a basement because his credit cards cost him $3,000 every month. You’ll fall behind on new feature development because you can’t build anything without fixing three old hacks first.

Unlike a credit card, the introductory rates of tech debt are hard to predict. You don’t know how many months of freedom from interest you get, and they may expire at the worst times. That zero-day patch might come out the week after you push your funky code to prod and you’ll be stuck with an outage. You might gamble if you know you’ll still be within the month’s SLA, but if you’ve gambled on twenty things like that you’ve got great odds that the bill on several debts will blow up at a bad time.

Every win has to come with hard questions about the debt it sits on. How much of this implementation will we be forced to rewrite? Does this new feature really work or does it just mostly work but we haven’t looked deep enough to see the problems? Do the funky parts of this code overlap with upcoming work?

Loans can get you ahead, and are manageable if you’re careful, but if you win by taking out too many it won’t matter how far ahead they got you. You’ll fall behind when they get too heavy. You’ll be a six figure team living in a basement.

Coverage, Syntax, And Chef

In Why I Don’t Track Test Coverage I explained why I don’t think coverage measures the quality of my tests. There’s a counter argument, though: 100% coverage means every line of code runs when the tests run, so it’s impossible to break prod with a misplaced comma. I think this counter argument is wrong when you’re working in Chef (I think it’s wrong in Python too but I’ll cover that in a separate post).

When Chef runs it processes each recipe into a list of actions (the ‘compile’ phase) and then it does those actions on the system (the ‘converge’ phase). The compile phase will (usually) execute every line of every recipe that’s in the run_list or is included with include_recipe. Both ChefSpec and kitchen/Serverspec take Chef through its compile phase, so in simple cases a syntax error will make both fail before the system is touched.

There are three (anti-)patterns in Chef that I know of that can sneak changes to system state past the compiler even when there are syntax errors:

#1 Raw Ruby

Chef recipes are Ruby files. You can put any valid Ruby code in them. You could do this:

File.delete('/etc/my_app/config.ini')

Ruby deletes config.ini as soon as it hits this line, before the rest of the compile finishes. If there’s a syntax problem later in the code you’ll still get an error but you’ll already have edited the system. The fallout of incomplete Chef client runs can get ugly (more on that another time).

Imagine if the tests for a Jenkins cookbook deleted the Jenkins config file. Then, a side-effect like this could take down a build server that does the innocent task of running ChefSpecs (which are only supposed to simulate Chef’s actions). It’s also surprisingly easy to accidentally hide this from the tests using #2 or #3 from below, which can cause incomplete Chef runs in production.

If you have side-effects like this in your code, replace them with a Chef resource (file with the :delete action in this case), write a custom resource, extract them into a gem that’s run before Chef runs, etc. Chef shouldn’t touch the state of the system before its converge phase.

#2 Ruby Conditions

Foodcritic, the linter for Chef, warns you not to do this:

if node['foo'] == 'bar'
  service 'apache' do
    action :enable
  end
end

Their argument is that you should use Guards, the library’s built-in feature:

service 'apache' do
  action :enable
  only_if { node['foo'] == 'bar' }
end

That’s a great argument, but there’s one more: with a Ruby condition, the resource won’t be compiled unless node[‘foo’] == ‘bar’. That means that unless you have a test where this is set, the compiler will never touch this resource and a syntax error won’t make the tests fail.

If you follow foodcritic’s recommendation, conditional resources will always be compiled (but may not be converged) and syntax errors will fail early without you doing any work.

#3 Conditional Includes

These technically belong with the other Ruby conditions, but they’re extra-nasty so I’m dedicating a section to them.

If you do this:

if node['foo'] == 'bar'
  include_recipe 'my_cookbook::my_recipe'
end

The resources in my_recipe will only be compiled if foo is set to bar in the node object. This is like putting a Ruby condition around every resource in my_recipe.

It gets worse if your condition is processed in the converge phase. For example, you could do an include in a ruby_block:

block 'run_my_recipe' do
  if File.directory?('/etc/my_app')
    run_context.include_recipe 'my_cookbook::my_recipe'
  end
end

Even if /etc/my_app exists, my_recipe won’t be compiled until Chef enters the converge phase and reaches the run_my_recipe resource. I bet you that nobody reading your cookbook will realize that it changes Chef’s “compile then converge” order into “compile some stuff then converge some stuff then compile the rest then converge the rest”. This is likely to bite you. Plus, now you have to start looking at mocks to make sure the tests exercise all your recipes. My advice is to avoid this pattern. Maybe there’s some special situation I haven’t found, but the few cases of converge-time conditional includes that I’ve seen have been hacks.

Conditional includes are usually a symptom of using Chef for something it’s not designed for. Chef is designed to converge the host where it’s running to a specific state. Its resources do a great job of detecting if that state is already present and skipping their actions if it is. If you have lots of resources that aren’t safe to run multiple times and that Chef isn’t automatically skipping then you should take a step back and make sure Chef is the right tool. Your standard approach should be to include all the recipes that may need to run and write each recipe to guard its resources from running when they shouldn’t.

 

If you write your Chef cookbooks well then you get 100% syntax coverage for free if you’ve written even one test. You can focus on exercising the logic of your recipes. Leave it to the robots to catch those misplaced commas.

Thanks for reading!

Adam