Simple Python boto3 Logging

Hello!

If you’re writing a lambda function, check out this article instead.

The best way to log output from boto3 is with Python’s logging library. The core docs have a nice tutorial (⬅️ check this out if you need more details on things like what log levels are and how to add timestamps to messages).

Good libraries, like boto, use Python’s logging library internally. If you set up a logger using the same library, boto’s logs will output automatically (along with your own logs). If you use print() statements, the only boto output you’ll see is what you capture and print yourself.

Here’s what I do for (simple) boto scripts:

import logging
import boto3

if __name__ == '__main__':
    logging.basicConfig(level=logging.INFO)
    logger = logging.getLogger()
    logger.info('Connecting to EC2...')
    ec2 = boto3.client('ec2')

That’s it! The basicConfig() function sets up the root logger for you at whatever level you choose. Anything you output in info() lines will appear. Anything boto logged at the info level (or higher) will appear.

If you set the level to INFO, you’ll get limited output:

INFO:root:Connecting to EC2...
INFO:botocore.credentials:Found credentials in environment variables.
Traceback (most recent call last):
  File "test.py", line 11, in <module>
    ec2 = boto3.client('ec2')
  File "/Users/adam/opt/env3/lib/python3.7/site-packages/boto3/__init__.py", line 91, in client
    return _get_default_session().client(*args, **kwargs)
  File "/Users/adam/opt/env3/lib/python3.7/site-packages/boto3/session.py", line 263, in client
    aws_session_token=aws_session_token, config=config)
  File "/Users/adam/opt/env3/lib/python3.7/site-packages/botocore/session.py", line 838, in create_client
    client_config=config, api_version=api_version)
  File "/Users/adam/opt/env3/lib/python3.7/site-packages/botocore/client.py", line 86, in create_client
    verify, credentials, scoped_config, client_config, endpoint_bridge)
  File "/Users/adam/opt/env3/lib/python3.7/site-packages/botocore/client.py", line 328, in _get_client_args
    verify, credentials, scoped_config, client_config, endpoint_bridge)
  File "/Users/adam/opt/env3/lib/python3.7/site-packages/botocore/args.py", line 47, in get_client_args
    endpoint_url, is_secure, scoped_config)
  File "/Users/adam/opt/env3/lib/python3.7/site-packages/botocore/args.py", line 117, in compute_client_args
    service_name, region_name, endpoint_url, is_secure)
  File "/Users/adam/opt/env3/lib/python3.7/site-packages/botocore/client.py", line 402, in resolve
    service_name, region_name)
  File "/Users/adam/opt/env3/lib/python3.7/site-packages/botocore/regions.py", line 122, in construct_endpoint
    partition, service_name, region_name)
  File "/Users/adam/opt/env3/lib/python3.7/site-packages/botocore/regions.py", line 135, in _endpoint_for_partition
    raise NoRegionError()
botocore.exceptions.NoRegionError: You must specify a region.

If you turn the level down to DEBUG, boto will tell you everything:

INFO:root:Connecting to EC2...
DEBUG:botocore.hooks:Changing event name from creating-client-class.iot-data to creating-client-class.iot-data-plane
DEBUG:botocore.hooks:Changing event name from before-call.apigateway to before-call.api-gateway
DEBUG:botocore.hooks:Changing event name from request-created.machinelearning.Predict to request-created.machine-learning.Predict
DEBUG:botocore.hooks:Changing event name from before-parameter-build.autoscaling.CreateLaunchConfiguration to before-parameter-build.auto-scaling.CreateLaunchConfiguration
DEBUG:botocore.hooks:Changing event name from before-parameter-build.route53 to before-parameter-build.route-53
DEBUG:botocore.hooks:Changing event name from request-created.cloudsearchdomain.Search to request-created.cloudsearch-domain.Search
DEBUG:botocore.hooks:Changing event name from docs.*.autoscaling.CreateLaunchConfiguration.complete-section to docs.*.auto-scaling.CreateLaunchConfiguration.complete-section
DEBUG:botocore.hooks:Changing event name from before-parameter-build.logs.CreateExportTask to before-parameter-build.cloudwatch-logs.CreateExportTask
DEBUG:botocore.hooks:Changing event name from docs.*.logs.CreateExportTask.complete-section to docs.*.cloudwatch-logs.CreateExportTask.complete-section
DEBUG:botocore.hooks:Changing event name from before-parameter-build.cloudsearchdomain.Search to before-parameter-build.cloudsearch-domain.Search
DEBUG:botocore.hooks:Changing event name from docs.*.cloudsearchdomain.Search.complete-section to docs.*.cloudsearch-domain.Search.complete-section
DEBUG:botocore.credentials:Looking for credentials via: env
INFO:botocore.credentials:Found credentials in environment variables.
DEBUG:botocore.loaders:Loading JSON file: /Users/adam/opt/env3/lib/python3.7/site-packages/botocore/data/endpoints.json
DEBUG:botocore.hooks:Event choose-service-name: calling handler <function handle_service_name_alias at 0x10cfb3048>
DEBUG:botocore.loaders:Loading JSON file: /Users/adam/opt/env3/lib/python3.7/site-packages/botocore/data/ec2/2016-11-15/service-2.json
DEBUG:botocore.hooks:Event creating-client-class.ec2: calling handler <function add_generate_presigned_url at 0x10cf70ae8>
Traceback (most recent call last):
  File "test.py", line 11, in <module>
    ec2 = boto3.client('ec2')
  File "/Users/adam/opt/env3/lib/python3.7/site-packages/boto3/__init__.py", line 91, in client
    return _get_default_session().client(*args, **kwargs)
  File "/Users/adam/opt/env3/lib/python3.7/site-packages/boto3/session.py", line 263, in client
    aws_session_token=aws_session_token, config=config)
  File "/Users/adam/opt/env3/lib/python3.7/site-packages/botocore/session.py", line 838, in create_client
    client_config=config, api_version=api_version)
  File "/Users/adam/opt/env3/lib/python3.7/site-packages/botocore/client.py", line 86, in create_client
    verify, credentials, scoped_config, client_config, endpoint_bridge)
  File "/Users/adam/opt/env3/lib/python3.7/site-packages/botocore/client.py", line 328, in _get_client_args
    verify, credentials, scoped_config, client_config, endpoint_bridge)
  File "/Users/adam/opt/env3/lib/python3.7/site-packages/botocore/args.py", line 47, in get_client_args
    endpoint_url, is_secure, scoped_config)
  File "/Users/adam/opt/env3/lib/python3.7/site-packages/botocore/args.py", line 117, in compute_client_args
    service_name, region_name, endpoint_url, is_secure)
  File "/Users/adam/opt/env3/lib/python3.7/site-packages/botocore/client.py", line 402, in resolve
    service_name, region_name)
  File "/Users/adam/opt/env3/lib/python3.7/site-packages/botocore/regions.py", line 122, in construct_endpoint
    partition, service_name, region_name)
  File "/Users/adam/opt/env3/lib/python3.7/site-packages/botocore/regions.py", line 135, in _endpoint_for_partition
    raise NoRegionError()
botocore.exceptions.NoRegionError: You must specify a region.

See how it started saying where it found AWS credentials? Imagine you’re trying to figure out why your script worked locally but didn’t work on an EC2 instance; knowing where it found keys is huge. Maybe there are some hardcoded ones you didn’t know about that it’s picking up instead of the IAM role you attached to the instance. In DEBUG mode that’s easy to figure out. With print you’d have to hack it out yourself.

This is great for simple scripts, but for something you’re going to run in production I recommend this pattern.

Happy automating!

Adam

Lamba: Filter boto3’s Logs into CloudWatch

Compute_GRAYSCALE_AWSLambda

Good morning!

If you’re a script with boto (i.e. not a lambda function), check out this article instead.

For those custom cases that don’t fit into Terraform or CloudFormation, a little bit of Python and some boto3 in a lambda function can save you. Lambda captures the output of both print() and logging.Logger calls into CloudWatch so it’s easy to log information about what your code is doing. When things go wrong, though, I often find that just the output I wrote doesn’t give me enough to diagnose the problem. In those cases, it’s helpful to see the log output both for your code and boto3. Here’s how you do that.

Use the logging library. It’s a Python core library that provides standard features like timestamped prefixes and support for levels (e.g. INFO or DEBUG). For simple deployment helpers this is usually all you need:

logger = logging.getLogger(logging.INFO)
logger.info('Message at the INFO level.')
logger.debug('Message at the DEBUG level.')

This sets the root logger (which sees all log messages) to the INFO level. Normally you’d have to configure the root logger, but lambda does that automatically (which is actually annoying if you need to change your formatter, but that’s for another post). Now, logger.info() calls will show up in the logs and logger.debug() calls won’t. If you increase the level to DEBUG you’ll see both.

Because logging is the standard Python way to handle log output, maintainers of libraries like boto3 use it throughout their code to show what the library is doing (and they’re usually smart about choosing what to log at each level). By setting a level on the root logger, you’re choosing which of your output to capture and which of boto3’s output to capture. Powerful when you’re diagnosing a failure.

Here’s a demo function to show how the output looks. You might notice that it puts the logger setup calls inside the handler even though the AWS docs tell you to put them under the import. Function calls made directly in modules (e.g. not inside functions declared within the module) are import-side effects and import side-effects are an anti-pattern. I put the calls in the handler so they only run when the handler is called. This isn’t likely to matter much in a lambda function, but I like to stick to good patterns.

import logging

import boto3

def lambda_handler(event, context):
    logger = logging.getLogger()
    logger.setLevel(logging.INFO)

    client = boto3.client('sts')
    account_id = client.get_caller_identity()['Account']

    logger.info('Getting account ID...')
    logger.debug('Account ID: {}'.format(account_id))
    return account_id

This is the output when run at the INFO level:

START RequestId: a61471fe-c3fd-11e8-9f43-bdb22e22a203 Version: $LATEST
[INFO]	2018-09-29T15:38:01.882Z	a61471fe-c3fd-11e8-9f43-bdb22e22a203	Found credentials in environment variables.
[INFO]	2018-09-29T15:38:02.83Z	a61471fe-c3fd-11e8-9f43-bdb22e22a203	Starting new HTTPS connection (1): sts.amazonaws.com
[INFO]	2018-09-29T15:38:02.531Z	a61471fe-c3fd-11e8-9f43-bdb22e22a203	Getting account ID...
END RequestId: a61471fe-c3fd-11e8-9f43-bdb22e22a203
REPORT RequestId: a61471fe-c3fd-11e8-9f43-bdb22e22a203	Duration: 734.96 ms	Billed Duration: 800 ms Memory Size: 128 MB	Max Memory Used: 29 MB

This is the output when run at the DEBUG level:

START RequestId: 9ea3bbef-c3fe-11e8-8eb1-730a799b5405 Version: $LATEST
[DEBUG]	2018-09-29T15:44:58.850Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Loading variable profile from defaults.
[DEBUG]	2018-09-29T15:44:58.880Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Loading variable config_file from defaults.
[DEBUG]	2018-09-29T15:44:58.881Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Loading variable credentials_file from defaults.
[DEBUG]	2018-09-29T15:44:58.881Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Loading variable data_path from defaults.
[DEBUG]	2018-09-29T15:44:58.881Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Loading variable region from environment with value 'us-west-2'.
[DEBUG]	2018-09-29T15:44:58.900Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Loading variable profile from defaults.
[DEBUG]	2018-09-29T15:44:58.900Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Loading variable ca_bundle from defaults.
[DEBUG]	2018-09-29T15:44:58.900Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Loading variable profile from defaults.
[DEBUG]	2018-09-29T15:44:58.900Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Loading variable api_versions from defaults.
[DEBUG]	2018-09-29T15:44:58.901Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Loading variable profile from defaults.
[DEBUG]	2018-09-29T15:44:58.901Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Loading variable credentials_file from defaults.
[DEBUG]	2018-09-29T15:44:58.901Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Loading variable config_file from defaults.
[DEBUG]	2018-09-29T15:44:58.901Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Loading variable profile from defaults.
[DEBUG]	2018-09-29T15:44:58.901Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Loading variable metadata_service_timeout from defaults.
[DEBUG]	2018-09-29T15:44:58.901Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Loading variable profile from defaults.
[DEBUG]	2018-09-29T15:44:58.901Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Loading variable metadata_service_num_attempts from defaults.
[DEBUG]	2018-09-29T15:44:58.942Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Loading variable profile from defaults.
[DEBUG]	2018-09-29T15:44:58.960Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Looking for credentials via: env
[INFO]	2018-09-29T15:44:58.960Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Found credentials in environment variables.
[DEBUG]	2018-09-29T15:44:58.961Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Loading JSON file: /var/runtime/botocore/data/endpoints.json
[DEBUG]	2018-09-29T15:44:59.1Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Loading variable profile from defaults.
[DEBUG]	2018-09-29T15:44:59.20Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Event choose-service-name: calling handler
[DEBUG]	2018-09-29T15:44:59.60Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Loading JSON file: /var/runtime/botocore/data/sts/2011-06-15/service-2.json
[DEBUG]	2018-09-29T15:44:59.82Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Event creating-client-class.sts: calling handler
[DEBUG]	2018-09-29T15:44:59.100Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	The s3 config key is not a dictionary type, ignoring its value of: None
[DEBUG]	2018-09-29T15:44:59.103Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Setting sts timeout as (60, 60)
[DEBUG]	2018-09-29T15:44:59.141Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Loading JSON file: /var/runtime/botocore/data/_retry.json
[DEBUG]	2018-09-29T15:44:59.141Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Registering retry handlers for service: sts
[DEBUG]	2018-09-29T15:44:59.160Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Event before-parameter-build.sts.GetCallerIdentity: calling handler
[DEBUG]	2018-09-29T15:44:59.161Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Making request for OperationModel(name=GetCallerIdentity) (verify_ssl=True) with params: {'url_path': '/', 'query_string': '', 'method': 'POST', 'headers': {'Content-Type': 'application/x-www-form-urlencoded; charset=utf-8', 'User-Agent': 'Boto3/1.7.74 Python/3.6.1 Linux/4.14.62-65.117.amzn1.x86_64 exec-env/AWS_Lambda_python3.6 Botocore/1.10.74'}, 'body': {'Action': 'GetCallerIdentity', 'Version': '2011-06-15'}, 'url': 'https://sts.amazonaws.com/', 'context': {'client_region': 'us-west-2', 'client_config': , 'has_streaming_input': False, 'auth_type': None}}
[DEBUG]	2018-09-29T15:44:59.161Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Event request-created.sts.GetCallerIdentity: calling handler
[DEBUG]	2018-09-29T15:44:59.161Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Event choose-signer.sts.GetCallerIdentity: calling handler
[DEBUG]	2018-09-29T15:44:59.162Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Calculating signature using v4 auth.
[DEBUG]	2018-09-29T15:44:59.180Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	CanonicalRequest:
POST
/

content-type:application/x-www-form-urlencoded; charset=utf-8
host:sts.amazonaws.com
x-amz-date:20180929T154459Z
x-amz-security-token:FQoGZXIvYXdzEKn//////////wEaDOOlIItIhtRakeAyfCLrAWPZXQJFkNrDZNa4Bny102eGKJ5KWD0F+ixFqZaW+A9mgadICpLRxBG4JGUzMtPTDeqxPoLT1qnS6bI/jVmXXUxjVPPMRiXdIlP+li0eFyB/xOK+PN/DOiByee0eu6bjQmkjoC3P5MREvxeanPY7hpgXNO52jSBPo8LMIdAcjCJxyRF7GHZjtZGAMARQWng6DJa9RAiIbxOmXpSbNGpABBVg/TUt8XMUT+p9Lm2Txi10P0ueu1n5rcuxJdBV8Jr/PUF3nZY+/k7MzOPCnzZNqVgpDAQbwby+AVIQcvVwaKsXePqubCqBTHxoh/Mo0ay+3QU=

content-type;host;x-amz-date;x-amz-security-token
ab821ae955788b0e33ebd34c208442ccfc2d406e2edc5e7a39bd6458fbb4f843
[DEBUG]	2018-09-29T15:44:59.181Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	StringToSign:
AWS4-HMAC-SHA256
20180929T154459Z
20180929/us-east-1/sts/aws4_request
7cf0af0e8f55fb1b9c0009104aa8f141097f00fea428ddf1654321e7054a920d
[DEBUG]	2018-09-29T15:44:59.181Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Signature:
c00de0a12c9ee0fce348df452f2833749b854915db58f8d106e3166545a70c43
[DEBUG]	2018-09-29T15:44:59.183Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Sending http request:
[INFO]	2018-09-29T15:44:59.201Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Starting new HTTPS connection (1): sts.amazonaws.com
[DEBUG]	2018-09-29T15:44:59.628Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	"POST / HTTP/1.1" 200 461
[DEBUG]	2018-09-29T15:44:59.628Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Response headers: {'x-amzn-requestid': '9f421e56-c3fe-11e8-b622-2d5da14a8dc9', 'content-type': 'text/xml', 'content-length': '461', 'date': 'Sat, 29 Sep 2018 15:44:58 GMT'}
[DEBUG]	2018-09-29T15:44:59.640Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Response body:
b'\n \n arn:aws:sts::268133297303:assumed-role/demo-boto3-logging/demo-boto3-logging\n AROAITTVSA67NGZPH2QZI:demo-boto3-logging\n 268133297303\n \n \n 9f421e56-c3fe-11e8-b622-2d5da14a8dc9\n \n\n'
[DEBUG]	2018-09-29T15:44:59.640Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Event needs-retry.sts.GetCallerIdentity: calling handler
[DEBUG]	2018-09-29T15:44:59.641Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	No retry needed.
[INFO]	2018-09-29T15:44:59.641Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Getting account ID...
[DEBUG]	2018-09-29T15:44:59.641Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Account ID: 268133297303
END RequestId: 9ea3bbef-c3fe-11e8-8eb1-730a799b5405
REPORT RequestId: 9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Duration: 813.73 ms	Billed Duration: 900 ms Memory Size: 128 MB	Max Memory Used: 29 MB

boto3 can be very verbose in DEBUG so I recommend staying at INFO unless you’re actively troubleshooting.

Happy debugging!

Adam

3 Tools to Validate CloudFormation

Hello!

Note: If you just want the script and don’t need the background, go to the gist.

If you found this page, SEO means you probably already found the AWS page on validating CloudFormation templates. If you haven’t, read that first. It’s a better starting place.

I run three tools before applying CF templates.

#1 AWS CLI’s validator

This is the native tool. It’s ok. It’s really only a syntax checker, there are plenty of errors you won’t see until you apply a template to a stack. Still, it’s fast and catches some things.

aws cloudformation validate-template --template-body file://./my_template.yaml

Notes:

  • The CLI has to be configured with access keys or it won’t run the validator.
  • If the template is JSON, this will ignore some requirements (e.g. it’ll allow trailing commas). However, the CF service ignores the same things.

#2 Python’s JSON library

Because the AWS CLI validator ignores some JSON requirements, I like to pass JSON templates through Python’s parser to make sure they’re valid. In the past, I’ve had to do things like load and search templates for unused parameters, etc. That’s not ideal but it’s happened a couple times while doing cleanup and refactoring of legacy code. It’s easier if the JSON is valid JSON.

It’s fiddly to run this in a shell script. I do it with a heredoc so I don’t have to write multiple scripts to the filesystem:

python - &lt;&lt;END
import json
with open(&#039;my_template.json&#039;) as f:
    json.load(f)
END

Notes:

  • I use Python for this because it’s a dependency of the AWS CLI so I know it’s already installed. You could use jq or another tool, though.
  • I don’t do the YAML equivalent of this because it errors on CF-specific syntax like !Ref.

#3 cfn-nag

This is a linter for CloudFormation. It’s not perfect. I’ve seen it generate false positives like “don’t use * in IAM policy resources” even when * is the only option because it’s all that’s supported by the service I’m writing a policy for. Still, it’s one more way to catch things before you deploy, and it catches some good stuff.

cfn_nag_scan --input-path my_template.yaml

Notes:

  • Annoyingly, this is a Ruby gem so you need a new dependency chain to install it. I highly recommend setting up RVM and creating a gemset to isolate this from your system and other projects (just like you’d do with a Python venv).

Happy automating!

Adam

Python on Mac OS X: One of the Good Ways

Good morning!

When I start Python development on a new Apple, I immediately hit two problems:

  1. I need a version of Python that’s not installed.
  2. I need to install a bunch of packages from PyPI for ProjectA and a different bunch for ProjectB.

Virtualenv is not the answer! That’s the first tool you’ll hear about but it only partially solves one of these problems. You need more. There are a ton of tools and a ton of different ways to use them. Here’s how I do it on Apple’s Mac OS X.

If you’re asking questions like, “Why do you need multiple versions installed? Isn’t latest enough?” or “Why not just pip install all the packages for ProjectA and ProjectB?” then this article probably isn’t where you should start. Great answers to those questions have already been written. This is just a disambiguation page that shows you which tools to use for which problems and how to use them.

Installing Python Versions

I use pyenv, which is available in homebrew. It allows me to install arbitrary versions of Python and switch between them without replacing what’s included with the OS.

Note: You can use homebrew to install other versions of Python, but only a single version of Python 2 and a single version of Python 3 at a time. You can’t easily switch between two projects each frozen at 3.4 and 3.6 (for example). There’s also a limited list of versions available.

Install pyenv:

$ brew update
$ brew install pyenv

Ensure pyenv loads when you login by adding this to ~/.profile:

$ eval "$(pyenv init -)"

Activate pyenv now by either closing and re-opening Terminal or running:

$ source ~/.profile

List which versions are available and install one:

$ pyenv install --list
$ pyenv install 3.6.4

If the version you wanted was missing, update pyenv via homebrew:

$ brew update &amp;&amp; brew upgrade pyenv

If you get weird errors about missing gcc or zlib, install the XCode Command Line Tools and try again:

$ xcode-select --install

I always set my global (aka default) version to the latest 3:

$ pyenv global 3.6.4

Update 2018-10-23: If I need several versions available, for example to run tests in tox:

$ pyenv global 3.6.4 3.7.0

Setting these makes versioned Python commands available:

$ python3.6 --version
$ python3.7 --version

Pyenv has lots of great features, like support for setting a different version whenever you’re in a specific directory. Check out its commands reference.

Installing PyPI Packages

In the old days, virtualenv was always the right solution. Today, it depends on the version of Python you’re using.

Python <= 3.3 (Including 2)

This is legacy Python, when environment management wasn’t native. In these ancient times, you needed a third party tool called virtualenv.

$ pyenv global 2.7.14
$ pip install virtualenv
$ virtualenv ~/my_env
$ source ~/my_env/bin/activate
(my_env) $ pip install 

This installs the virtualenv Python package into the root environment for the legacy version of Python I need, then creates a virtual Python environment where I can install project-specific dependencies.

Python >= 3.3

In PEP 405 an environment manager called venv was added to core. It works pretty much like virtualenv.

Note: Virtualenv works with newer versions of Python, but it’s better to use a core library than to add a dependency. I only use the third party tool when I have to.

$ pyenv global 3.6.4
$ python -m venv ~/my_env
$ source ~/my_env/bin/activate
(my_env) $ pip install 

Happy programming!

Adam

EC2: Stateful Statelessness

networkingcontentdelivery_grayscale_amazonvpc.png

Hello!

While studying for an AWS certification, I rediscovered a fiddly networking detail: although ICMP’s ping is stateless, EC2 security groups will pass return ping traffic even when only one direction is defined in their rules. I wanted to see this in action, so I built a lab.

If you just asked, “Wat❓”, keep reading. Skip to the next section if you just want the code.

Background

Network hosts using stateful protocols (like TCP) distinguish between packets that are part of an established connection and packets that are new. For example, when I SSH (which runs on TCP) from A to B:

  1. A asks B to start a new connection.
  2. B agrees.
  3. A and B exchange bunches of packets that are part of the connection they agreed to.
  4. A and B agree to close the connection.

There’s a difference between a new packet and a packet that’s part of an ongoing connection. That means the connection, and its packets, have state (e.g. new vs established). Stateful firewalls (vs stateless) are aware of this:

  1. A ask B to start a new connection.
  2. Firewalls in between allow these packets if there is an explicit rule allowing traffic from A to B.
  3. A and B exchange bunches of packets.
  4. Firewalls in between allow the packets from A to B because of the explicit rule above. However, they allow the return traffic from B to A even if there is no explicit rule to allow it. Since B agreed to the connection the firewall assumes that packets in that connection should be allowed.

In EC2, this is why you only need an outgoing rule on A’s Security Group (SG) and an incoming rule on B’s Security Group to SSH from A to B. EC2 SGs are stateful, and allow the return traffic implicitly.

Ok, here’s the gnarly bit. ICMP (the protocol behind ping) is stateless. Hosts don’t have a negotiation phase where the agree to establish a connection. They just send packets and hope. So, doesn’t that mean I need to write explicit firewall rules in the SGs to allow the return traffic? If the firewall can’t see the state of the connection, it won’t be able to implicitly figure out to allow that traffic, right?

Nope, they infer state based on timeouts and packet types. ICMP pings are ECHO requests answered by ECHO replies. If the SG has seen a request within the timeout, it makes the educated guess that replies are essentially part of “established” connections and allows them. This is what I wanted to see in action.

The Lab

I setup a VPC with two hosts, A (10.0.1.70) and B (10.0.2.35). They’re in different subnets but the ACLs allow all traffic so they don’t influence the test. Here are the SG rules for A (10.0.0.0/16 covers the entire VPC):

test_a_inboundtest_a_outbound

And the rules for B:

test_b_inboundtest_b_outbound

A allows outgoing ICMP to B, and B allows incoming ICMP from A. The return traffic is not allowed by any rules.

The Test Script

I didn’t find a way to send just replies without requests in Linux, so I bodged together a Python script:

&quot;&quot;&quot;
This is a stripped-down version of ping that allows you to send a reply without responding a request. This was needed
to test the details of how Amazon EC2 security groups handle state with ICMP traffic. You shouldn't use this for normal
pings.

The ping implementation was based on Samuel Stauffer's python-ping: https://github.com/samuel/python-ping (which only
works with Python 2).

This must be run as root.
You must tell the Linux kernel to ignore ICMP before you run this or it'll eat some of the traffic:
    echo 1 &gt; /proc/sys/net/ipv4/icmp_echo_ignore_all
&quot;&quot;&quot;

import argparse, socket, struct, time

def get_arguments():
    parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)
    parser.add_argument('--send-request', metavar='IP_ADDRESS', type=str, help='IP address to send ECHO request.')
    parser.add_argument('--receive', action='store_true', help='Wait for a reply.')
    parser.add_argument('--send-reply', metavar='IP_ADDRESS', type=str, help='IP address to send ECHO reply.')
    return parser.parse_args()

def receive(my_socket):
    while True:
        recPacket, addr = my_socket.recvfrom(1024)
        icmpHeader = recPacket[20:28]
        icmp_type, code, checksum, packetID, sequence = struct.unpack(&quot;bbHHh&quot;, icmpHeader)
        print('Received type {}.'.format(icmp_type))

def ping(my_socket, dest_addr, icmp_type):
    dest_addr = socket.gethostbyname(dest_addr)
    bytesInDouble = struct.calcsize(&quot;d&quot;)
    data = (192 - bytesInDouble) * &quot;Q&quot;
    data = struct.pack(&quot;d&quot;, time.time()) + data
    dummy_checksum = 1 &amp; 0xffff
    dummy_id = 1 &amp; 0xFFFF
    # Header is type (8), code (8), checksum (16), id (16), sequence (16)
    header = struct.pack(&quot;bbHHh&quot;, icmp_type, 0, socket.htons(dummy_checksum), dummy_id, 1)
    packet = header + data
    my_socket.sendto(packet, (dest_addr, 1))

if __name__ == '__main__':
    args = get_arguments()
    icmp = socket.getprotobyname(&quot;icmp&quot;)
    my_socket = socket.socket(socket.AF_INET, socket.SOCK_RAW, icmp)
    if args.send_request:
        ping(my_socket, args.send_request, icmp_type=8)  # Type 8 is ECHO request.
    if args.receive:
        receive(my_socket)
    if args.send_reply:
        ping(my_socket, args.send_reply, icmp_type=0)  # Type 0 is ECHO reply.

You can skip reading the code, the important thing is that we can individually choose to listen for packets, send ECHO requests, or send ECHO replies:

python ping.py --help
usage: ping.py [-h] [--send-request IP_ADDRESS] [--receive]
               [--send-reply IP_ADDRESS]

optional arguments:
  -h, --help            show this help message and exit
  --send-request IP_ADDRESS
                        IP address to send ECHO request. (default: None)
  --receive             Wait for a reply. (default: False)
  --send-reply IP_ADDRESS
                        IP address to send ECHO reply. (default: None)&lt;span 				data-mce-type=&quot;bookmark&quot; 				id=&quot;mce_SELREST_start&quot; 				data-mce-style=&quot;overflow:hidden;line-height:0&quot; 				style=&quot;overflow:hidden;line-height:0&quot; 			&gt;&lt;/span&gt;

The Experiment

SSH to each host and tell Linux to ignore ICMP traffic so I can use the script to capture it (see docstring in the script above):

sudo su -
echo 1 &gt; /proc/sys/net/ipv4/icmp_echo_ignore_all
exit

Normal Ping

I send a request from A to B and expect the reply from B to A to be allowed. Here’s what happened:

Remember A is 10.0.1.70 and B is 10.0.2.35.

normal_ping_test_anormal_ping_test_b

Ok, nothing surprising. I sent a request from A to B and started listening on A, a little later I send a reply from B to A and it was allowed. You can do this test with the normal Linux ping command (but not until you tell the kernel to stop ignoring ICMP traffic). This test just validates that my bodged Python actually works.

Reply Only

First we wait a bit. The previous test sent a request from A to B, which started a timer in the SG. Until that timer expires, reply traffic will be allowed. We need to wait for that expiration before this next test is valid.

reply_only_ping_test_areply_only_ping_test_b

Boom! I start listening on A, without sending a request. On B I send a reply to A but it never arrives. The Security Group didn’t allow it. This demonstrates that EC2 Security Groups are inferring the state of ICMP pings by reading their type.

Other Tests

I also tried a couple other things that I’ll leave to you to reproduce in your own lab if you want to see them.

  • Start out like the normal test. Send a request from A to B and start listening on A. Then send several replies from B to A. They’re all allowed. This shows that the SG isn’t counting to ensure it only allows one reply for each request; if it has seen just one request within the timeout it allows replies even if there are multiple.
  • Edit the script above to set the hardcoded ID to be different on A than it is on B. Then nothing works at all. I’m not actually sure what causes this. Could be that the SG is looking at more than just the type, but it could also be in the kernel or the network drivers or somewhere else I haven’t thought of. If you figure it out, message me!

Conclusion

I had free time over the holidays this year! Realistically, understanding this demo isn’t a priority for doing good work on AWS. I just enjoy unwrapping black boxes to see how the parts move.

Be well!

Adam

Production-ready Scripts and Python

Production is hard. Even a simple script that looks up queue state and sends it to an API gets complex in prod. Without tests, the divide by zero case you missed will mask queue overloads. Someone won’t see that required argument you didn’t enforce and break everything when they accidentally publish a null value. You’ll forget to timestamp one of your output lines and then when the queue goes down you won’t be able to correlate queue status to network events.

Python can help! Out of box it can give you essential but often-skipped features, like these:

  • Automated tests for multiple platforms.
  • A --simulate option.
  • Command line sanity like a --help option and enforcement of required arguments.
  • Informative log output.
  • An easy way to build and package.
  • An easy way to install a build without a git clone.
  • A command that you can just run like any other command. No weird shell setup or invocation required.

It can be a little tricky, though, if you haven’t already done it. So I wrote a project that demonstrates it for you. It includes an example of a script that isn’t ready for prod.

Hopefully this will save you from some of the many totally avoidable, horrible problems that bad scripts have caused in my prods.

Thanks for reading!

Adam

Credit Card Debt

Technical debt is like a new credit card, it often comes with a 0% introductory interest rate. In the short term tech debt can look like a win; you get the new feature on time, you automate a manual process, you patch the bug. Maybe the implementation wasn’t perfect, but dealing with a bit of funky code or living with a few bugs is better than missing the deadline.

That loan comes due right away, you have to live with what you wrote, but the interest comes later. In a month (or three or six) something will happen that magnifies the impact of that funkiness or those bugs. You’ll need an unrelated feature but because you monkey-patched in the config for the first feature you’ll be forced to rewrite the config system before you can start, adding days to your timeline. You’ll need to install a zero-day security patch but your runtime hack will force you to shut down before you can patch, causing an outage.

Like a credit card, tech debt is manageable. If you pay back the new card on the right schedule it can get you the new TV without making you miss a rent payment. If you clean up your runtime hack in the next couple weeks it’s unlikely that a zero-day patch will be released before you’re done. If you don’t pay it back or you take out too many new cards, you can end up like the guy who makes six figures but rents a basement because his credit cards cost him $3,000 every month. You’ll fall behind on new feature development because you can’t build anything without fixing three old hacks first.

Unlike a credit card, the introductory rates of tech debt are hard to predict. You don’t know how many months of freedom from interest you get, and they may expire at the worst times. That zero-day patch might come out the week after you push your funky code to prod and you’ll be stuck with an outage. You might gamble if you know you’ll still be within the month’s SLA, but if you’ve gambled on twenty things like that you’ve got great odds that the bill on several debts will blow up at a bad time.

Every win has to come with hard questions about the debt it sits on. How much of this implementation will we be forced to rewrite? Does this new feature really work or does it just mostly work but we haven’t looked deep enough to see the problems? Do the funky parts of this code overlap with upcoming work?

Loans can get you ahead, and are manageable if you’re careful, but if you win by taking out too many it won’t matter how far ahead they got you. You’ll fall behind when they get too heavy. You’ll be a six figure team living in a basement.

Coverage, Syntax, And Chef

In Why I Don’t Track Test Coverage I explained why I don’t think coverage measures the quality of my tests. There’s a counter argument, though: 100% coverage means every line of code runs when the tests run, so it’s impossible to break prod with a misplaced comma. I think this counter argument is wrong when you’re working in Chef (I think it’s wrong in Python too but I’ll cover that in a separate post).

When Chef runs it processes each recipe into a list of actions (the ‘compile’ phase) and then it does those actions on the system (the ‘converge’ phase). The compile phase will (usually) execute every line of every recipe that’s in the run_list or is included with include_recipe. Both ChefSpec and kitchen/Serverspec take Chef through its compile phase, so in simple cases a syntax error will make both fail before the system is touched.

There are three (anti-)patterns in Chef that I know of that can sneak changes to system state past the compiler even when there are syntax errors:

#1 Raw Ruby

Chef recipes are Ruby files. You can put any valid Ruby code in them. You could do this:

File.delete('/etc/my_app/config.ini')

Ruby deletes config.ini as soon as it hits this line, before the rest of the compile finishes. If there’s a syntax problem later in the code you’ll still get an error but you’ll already have edited the system. The fallout of incomplete Chef client runs can get ugly (more on that another time).

Imagine if the tests for a Jenkins cookbook deleted the Jenkins config file. Then, a side-effect like this could take down a build server that does the innocent task of running ChefSpecs (which are only supposed to simulate Chef’s actions). It’s also surprisingly easy to accidentally hide this from the tests using #2 or #3 from below, which can cause incomplete Chef runs in production.

If you have side-effects like this in your code, replace them with a Chef resource (file with the :delete action in this case), write a custom resource, extract them into a gem that’s run before Chef runs, etc. Chef shouldn’t touch the state of the system before its converge phase.

#2 Ruby Conditions

Foodcritic, the linter for Chef, warns you not to do this:

if node['foo'] == 'bar'
  service 'apache' do
    action :enable
  end
end

Their argument is that you should use Guards, the library’s built-in feature:

service 'apache' do
  action :enable
  only_if { node['foo'] == 'bar' }
end

That’s a great argument, but there’s one more: with a Ruby condition, the resource won’t be compiled unless node[‘foo’] == ‘bar’. That means that unless you have a test where this is set, the compiler will never touch this resource and a syntax error won’t make the tests fail.

If you follow foodcritic’s recommendation, conditional resources will always be compiled (but may not be converged) and syntax errors will fail early without you doing any work.

#3 Conditional Includes

These technically belong with the other Ruby conditions, but they’re extra-nasty so I’m dedicating a section to them.

If you do this:

if node['foo'] == 'bar'
  include_recipe 'my_cookbook::my_recipe'
end

The resources in my_recipe will only be compiled if foo is set to bar in the node object. This is like putting a Ruby condition around every resource in my_recipe.

It gets worse if your condition is processed in the converge phase. For example, you could do an include in a ruby_block:

block 'run_my_recipe' do
  if File.directory?('/etc/my_app')
    run_context.include_recipe 'my_cookbook::my_recipe'
  end
end

Even if /etc/my_app exists, my_recipe won’t be compiled until Chef enters the converge phase and reaches the run_my_recipe resource. I bet you that nobody reading your cookbook will realize that it changes Chef’s “compile then converge” order into “compile some stuff then converge some stuff then compile the rest then converge the rest”. This is likely to bite you. Plus, now you have to start looking at mocks to make sure the tests exercise all your recipes. My advice is to avoid this pattern. Maybe there’s some special situation I haven’t found, but the few cases of converge-time conditional includes that I’ve seen have been hacks.

Conditional includes are usually a symptom of using Chef for something it’s not designed for. Chef is designed to converge the host where it’s running to a specific state. Its resources do a great job of detecting if that state is already present and skipping their actions if it is. If you have lots of resources that aren’t safe to run multiple times and that Chef isn’t automatically skipping then you should take a step back and make sure Chef is the right tool. Your standard approach should be to include all the recipes that may need to run and write each recipe to guard its resources from running when they shouldn’t.

 

If you write your Chef cookbooks well then you get 100% syntax coverage for free if you’ve written even one test. You can focus on exercising the logic of your recipes. Leave it to the robots to catch those misplaced commas.

Thanks for reading!

Adam

Why I Don’t Track Test Coverage

Last year I went through my programming habits looking for things to improve. All my projects had coverage reports, and my coverage was 95-100%. Looking deeper, I found that developing that coverage had actually hurt my projects. I decided to turn off coverage reports, and a year later I have no reason to turn them back on. Here’s why:

#1 Coverage is distracting.

I littered my code with markers telling the coverage engine to skip hard-to-test lines like Python’s protection against import side effects:

if __name__ == &quot;__main__&quot;: # pragma: no cover
    main()

In some projects I had a test_coverage.py test module just for the tests that tricked the argument handling code into running. Most of those tests barely did more than assert that core libraries worked.

I also went down rabbit trails trying to find a way to mock enough of boilerplate code like module loaders to get a few more lines to run. Those were often fiddly areas of the language and their rabbit trails were surprisingly long.

#2 Coverage earns undeserved confidence

While cleaning up old code written by somebody else I wrote a test suite to protect me from regressions. Its had 98% coverage. It didn’t protect me from anything. The code was full of stuff like this:

main.py

import dbs

def helper_function():
    tables = dbs.get_tables()
    # ... other db-related stuff.

dbs.py

DBS = ['db1.local', 'db2.local']
TABLES = list()
for db in DBS:
    # Bunch of code that generates table names.

This is terrible code, but I was stuck with it. One of its problems is that dbs.py is a side-effect; ‘import dbs’ causes the code in that module to execute. To write a simple test of helper_function I had to import from main.py, which caused an import of the dbs module, which ran all the lines in that module. A test of a five-line function took me from 0% coverage to over 50%.

When I hit 98% coverage I stopped writing tests, but I was still plagued by regressions during my refactors. The McCabe complexity of the code was over 12 and asserting the behavior buried in those lines needed two or three times the number of tests I’d written. Most tests would run the same lines over and over because of the import side-effects, but each test would work the code in a different way.

 

I considered revising my test coverage habits. Excluding broader sections of code from coverage reports so I didn’t have to mock out things like argument handling. Reducing my threshold of coverage from 95% to 75%. Treating legacy code as a special case and just turning off coverage there. But if I did all those things, the tests that were left were the tests I’d have written whether or not I was thinking about coverage.

Today, I don’t think about covering the code, I think about exercising it. I ask myself questions like these:

  • Is there anything I check by hand after running the tests? Write a test for that.
  • Will it work if it gets junk input? Use factories or mocks to create that input.
  • If I pass it the flag to simulate, does it touch the database? Write a test with a mocked DB connector to make sure it doesn’t.

Your tests shouldn’t be there to run lines of code. They should be there to assert that the code does what it’s supposed to do.

Happy Thanksgiving!

Adam