The Service Checklists

One day I get a text from the illimitable Kai Davis. He’s had a Bad Moment.

Adam. I have terrible OpSec.

A former user had deleted a bunch of files. Luckily, he was able to recover.

Teach me how to OpSec.

No worries buddy. I got you.

Kai is a power user, and in today’s Internet that means he subscribes to two dozen hosted services. How do you manage two dozen services and keep any kind of sanity? I do it with checklists (⬅️ read this book).

Before I show them to you, we need to cover one of the Big Important Things from Mr. Gawande’s book. Kai already knows how to manage his services. He just needs to make sure he hasn’t forgotten something important like disabling access for former users.

I wrote Kai two checklists. One to use monthly to make sure nothing gets missed and one to use when setting up new services to reduce the monthly work. I assume he has a master spreadsheet listing all his services. Kai’s Bad Moment categorizes as OpSec, but I didn’t limit these lists to that category.

Hopefully, these help you as well.

The Monthly Checklist

  • Can I cancel this service?
  • Should I delete users?
  • Should I change shared passwords?
  • Should I un-share anything?
  • Should I force-disconnect any devices?
  • Is the domain name about to expire?
  • Is the credit card about to expire?
  • Am I paying for more than I use?
  • Should I cancel auto-renewal?
  • Are there any messages from the provider in my account? (new!)
  • Is the last backup bigger than the one before it?

The Setup Checklist

  • Add row to master spreadsheet.
  • Save URL, account ID, username, password, email address, and secret questions in 1password.
  • Sign up for paperless everything.
  • Enter phone number and mailing address into account profile.
  • Review privacy settings.
  • Enable MFA.
  • Send hardcopy of MFA backup codes offsite.
  • Setup recurring billing.
  • Set alarm to manually check the first auto-bill.
  • Set alarm to revisit billing choices.
  • Set schedule for backups.
  • Check that backups contain the really important data.
  • Create a user for my assistant.
  • Confirm my assistant has logged in.

Some Notes

Monthly

  • Can I cancel this service? I always ask “can I”, not “should I”. There’s always a reason to keep it, but I want a reason to nuke it.
  • Am I paying for more than I use? I look at current usage, not predicted usage. The number is often not actionable, but it’s a good lens.

Setup

  • Save URL, account ID, username, password, email address, and secret questions in 1password. The URL matters because 1password will use it to give you warnings about known vulnerabilities that you need to change your password to remediate. The email address and username may seem redundant, but having both has saved me a bunch of times. Same with secret questions.
  • Enter phone number and mailing address into account profile. These make recovery and support calls easier.
  • Review privacy settings. Remember, Kai already knows how to manage his services. He knows how to pick good privacy settings. But privacy settings are often hidden and it’s easy to forget them when signing up.
  • Enable MFA. I know it sucks, but the security landscape gets worse every day. Use this for anything expensive or private.
  • Send hardcopy of MFA backup codes offsite. I have watched people spend months on account recovery when their phones die and they lose their Google Auth.
  • Set alarm to manually check the first auto-bill. This saves me all the time. All. The. Time.
  • Set alarm to revisit billing choices. This has saved me thousands of dollars.
  • Set schedule for backups. Even if it’s an alarm to do a manual backup once a month.

Stay safe!

Adam

Credit Card Debt

Technical debt is like a new credit card, it often comes with a 0% introductory interest rate. In the short term tech debt can look like a win; you get the new feature on time, you automate a manual process, you patch the bug. Maybe the implementation wasn’t perfect, but dealing with a bit of funky code or living with a few bugs is better than missing the deadline.

That loan comes due right away, you have to live with what you wrote, but the interest comes later. In a month (or three or six) something will happen that magnifies the impact of that funkiness or those bugs. You’ll need an unrelated feature but because you monkey-patched in the config for the first feature you’ll be forced to rewrite the config system before you can start, adding days to your timeline. You’ll need to install a zero-day security patch but your runtime hack will force you to shut down before you can patch, causing an outage.

Like a credit card, tech debt is manageable. If you pay back the new card on the right schedule it can get you the new TV without making you miss a rent payment. If you clean up your runtime hack in the next couple weeks it’s unlikely that a zero-day patch will be released before you’re done. If you don’t pay it back or you take out too many new cards, you can end up like the guy who makes six figures but rents a basement because his credit cards cost him $3,000 every month. You’ll fall behind on new feature development because you can’t build anything without fixing three old hacks first.

Unlike a credit card, the introductory rates of tech debt are hard to predict. You don’t know how many months of freedom from interest you get, and they may expire at the worst times. That zero-day patch might come out the week after you push your funky code to prod and you’ll be stuck with an outage. You might gamble if you know you’ll still be within the month’s SLA, but if you’ve gambled on twenty things like that you’ve got great odds that the bill on several debts will blow up at a bad time.

Every win has to come with hard questions about the debt it sits on. How much of this implementation will we be forced to rewrite? Does this new feature really work or does it just mostly work but we haven’t looked deep enough to see the problems? Do the funky parts of this code overlap with upcoming work?

Loans can get you ahead, and are manageable if you’re careful, but if you win by taking out too many it won’t matter how far ahead they got you. You’ll fall behind when they get too heavy. You’ll be a six figure team living in a basement.

Coverage, Syntax, And Chef

In Why I Don’t Track Test Coverage I explained why I don’t think coverage measures the quality of my tests. There’s a counter argument, though: 100% coverage means every line of code runs when the tests run, so it’s impossible to break prod with a misplaced comma. I think this counter argument is wrong when you’re working in Chef (I think it’s wrong in Python too but I’ll cover that in a separate post).

When Chef runs it processes each recipe into a list of actions (the ‘compile’ phase) and then it does those actions on the system (the ‘converge’ phase). The compile phase will (usually) execute every line of every recipe that’s in the run_list or is included with include_recipe. Both ChefSpec and kitchen/Serverspec take Chef through its compile phase, so in simple cases a syntax error will make both fail before the system is touched.

There are three (anti-)patterns in Chef that I know of that can sneak changes to system state past the compiler even when there are syntax errors:

#1 Raw Ruby

Chef recipes are Ruby files. You can put any valid Ruby code in them. You could do this:

File.delete('/etc/my_app/config.ini')

Ruby deletes config.ini as soon as it hits this line, before the rest of the compile finishes. If there’s a syntax problem later in the code you’ll still get an error but you’ll already have edited the system. The fallout of incomplete Chef client runs can get ugly (more on that another time).

Imagine if the tests for a Jenkins cookbook deleted the Jenkins config file. Then, a side-effect like this could take down a build server that does the innocent task of running ChefSpecs (which are only supposed to simulate Chef’s actions). It’s also surprisingly easy to accidentally hide this from the tests using #2 or #3 from below, which can cause incomplete Chef runs in production.

If you have side-effects like this in your code, replace them with a Chef resource (file with the :delete action in this case), write a custom resource, extract them into a gem that’s run before Chef runs, etc. Chef shouldn’t touch the state of the system before its converge phase.

#2 Ruby Conditions

Foodcritic, the linter for Chef, warns you not to do this:

if node['foo'] == 'bar'
  service 'apache' do
    action :enable
  end
end

Their argument is that you should use Guards, the library’s built-in feature:

service 'apache' do
  action :enable
  only_if { node['foo'] == 'bar' }
end

That’s a great argument, but there’s one more: with a Ruby condition, the resource won’t be compiled unless node[‘foo’] == ‘bar’. That means that unless you have a test where this is set, the compiler will never touch this resource and a syntax error won’t make the tests fail.

If you follow foodcritic’s recommendation, conditional resources will always be compiled (but may not be converged) and syntax errors will fail early without you doing any work.

#3 Conditional Includes

These technically belong with the other Ruby conditions, but they’re extra-nasty so I’m dedicating a section to them.

If you do this:

if node['foo'] == 'bar'
  include_recipe 'my_cookbook::my_recipe'
end

The resources in my_recipe will only be compiled if foo is set to bar in the node object. This is like putting a Ruby condition around every resource in my_recipe.

It gets worse if your condition is processed in the converge phase. For example, you could do an include in a ruby_block:

block 'run_my_recipe' do
  if File.directory?('/etc/my_app')
    run_context.include_recipe 'my_cookbook::my_recipe'
  end
end

Even if /etc/my_app exists, my_recipe won’t be compiled until Chef enters the converge phase and reaches the run_my_recipe resource. I bet you that nobody reading your cookbook will realize that it changes Chef’s “compile then converge” order into “compile some stuff then converge some stuff then compile the rest then converge the rest”. This is likely to bite you. Plus, now you have to start looking at mocks to make sure the tests exercise all your recipes. My advice is to avoid this pattern. Maybe there’s some special situation I haven’t found, but the few cases of converge-time conditional includes that I’ve seen have been hacks.

Conditional includes are usually a symptom of using Chef for something it’s not designed for. Chef is designed to converge the host where it’s running to a specific state. Its resources do a great job of detecting if that state is already present and skipping their actions if it is. If you have lots of resources that aren’t safe to run multiple times and that Chef isn’t automatically skipping then you should take a step back and make sure Chef is the right tool. Your standard approach should be to include all the recipes that may need to run and write each recipe to guard its resources from running when they shouldn’t.

 

If you write your Chef cookbooks well then you get 100% syntax coverage for free if you’ve written even one test. You can focus on exercising the logic of your recipes. Leave it to the robots to catch those misplaced commas.

Thanks for reading!

Adam

Why I Don’t Track Test Coverage

Last year I went through my programming habits looking for things to improve. All my projects had coverage reports, and my coverage was 95-100%. Looking deeper, I found that developing that coverage had actually hurt my projects. I decided to turn off coverage reports, and a year later I have no reason to turn them back on. Here’s why:

#1 Coverage is distracting.

I littered my code with markers telling the coverage engine to skip hard-to-test lines like Python’s protection against import side effects:

if __name__ == "__main__": # pragma: no cover
    main()

In some projects I had a test_coverage.py test module just for the tests that tricked the argument handling code into running. Most of those tests barely did more than assert that core libraries worked.

I also went down rabbit trails trying to find a way to mock enough of boilerplate code like module loaders to get a few more lines to run. Those were often fiddly areas of the language and their rabbit trails were surprisingly long.

#2 Coverage earns undeserved confidence

While cleaning up old code written by somebody else I wrote a test suite to protect me from regressions. Its had 98% coverage. It didn’t protect me from anything. The code was full of stuff like this:

main.py

import dbs

def helper_function():
    tables = dbs.get_tables()
    # ... other db-related stuff.

dbs.py

DBS = ['db1.local', 'db2.local']
TABLES = list()
for db in DBS:
    # Bunch of code that generates table names.

This is terrible code, but I was stuck with it. One of its problems is that dbs.py is a side-effect; ‘import dbs’ causes the code in that module to execute. To write a simple test of helper_function I had to import from main.py, which caused an import of the dbs module, which ran all the lines in that module. A test of a five-line function took me from 0% coverage to over 50%.

When I hit 98% coverage I stopped writing tests, but I was still plagued by regressions during my refactors. The McCabe complexity of the code was over 12 and asserting the behavior buried in those lines needed two or three times the number of tests I’d written. Most tests would run the same lines over and over because of the import side-effects, but each test would work the code in a different way.

 

I considered revising my test coverage habits. Excluding broader sections of code from coverage reports so I didn’t have to mock out things like argument handling. Reducing my threshold of coverage from 95% to 75%. Treating legacy code as a special case and just turning off coverage there. But if I did all those things, the tests that were left were the tests I’d have written whether or not I was thinking about coverage.

Today, I don’t think about covering the code, I think about exercising it. I ask myself questions like these:

  • Is there anything I check by hand after running the tests? Write a test for that.
  • Will it work if it gets junk input? Use factories or mocks to create that input.
  • If I pass it the flag to simulate, does it touch the database? Write a test with a mocked DB connector to make sure it doesn’t.

Your tests shouldn’t be there to run lines of code. They should be there to assert that the code does what it’s supposed to do.

Happy Thanksgiving!

Adam

The Paper Shredder

Four years ago, I was a System Administrator fixing wifi and printers. I learned a lot about diagnostics from those jobs. Like this: You don’t fix problems, you understand them. Often, you’re done ten minutes after you figure out what’s happening.

A couple folks in the office told me their wifi would stop working for a few minutes randomly throughout the day. But not every day. Some days it’d break a few times, other days no problems.

Errors logged in the access points? None. One dead access point that devices usually ignored because it had low signal most of the time? Nope. Weird errors in the laptop’s system logs? None. DNS or other non-wifi network problems? No symptoms.

I couldn’t figure it out and ended up telling them to text me next time it happened so I could walk over and do some testing while it was broken. Turns out I wouldn’t need to.

The next morning I was walking in and passed an office supply room where one of the wifi access points was installed. I noticed it had been moved off its shelf. Someone had set it on a paper shredder.

Paper shredders have electric motors in them. Every time someone turned it on it shredded the signal and anyone connected to that access point got disconnected. A few folks weren’t in range of any other access points, so they lost network connection. After the shredding stopped they reconnected. I moved it back. Problem solved.

There is art in solving problems (and sometimes luck), but it’s often not in your solution. The magic is in finding new places to look for causes when you’ve already looked everywhere.

Thanks for reading!

Adam

P.S. I thought about telling this as a story about best practices instead of diagnostics. If the access point had been mounted to the ceiling instead of sitting on a shelf nobody could have moved it. What do you think, should I do a follow up post?