Articles

The New Project Manager’s Glossary: Cloud and DevOps

I often meet Project Managers who are new to the cloud or DevOps or sometimes new to software altogether. There’s plenty of jargon in these spaces, and often definitions are hard to find. Quite a few folks have asked me to help define the jargon, so I decided to write it up.

This is an opinionated list. It’s also a simplification. It summarizes what I’ve personally learned in my years in these spaces. There are other definitions, but these should get you close enough to work within the context of conversations.

This list starts with the boring terms that you’re most likely to already know and builds them up into the more esoteric ones. Sort of. There’s a lot of interconnection. If you see a term you don’t know, try looking farther down in the list.

I’ve written these definitions around an example case: a company that makes a golfing website where golf enthusiasts can buy golf stuff. The product is the golfing website. The customers are the golfers.

Code: A synonym of software and of program. You “write” or “develop” code/software/programs. Code is the informal term, software is the formal one. Program is an older word that nobody says anymore.

Development: The process of writing code. A synonym of coding and of programming. Coding is the informal term, developing is the more formal one. Programming is still in use, but it’s less common. “Coders developing software” means the same thing as “software engineers writing code” means the same thing as “coders coding”. Technically those are the same as “programming programs”, but nobody would say it that way.

Application Development: The same as development, but specifically the development of the golfing website. This distinction matters because DevOps engineers are also developers who write software, but their software never gets used by customers.

End User: The customers who actually use the final product. Golfers who buy golf stuff from your golfing website. They’re the people at the “end” of the whole system of technology that makes the website work.

Server: A computer that runs the golfing website. Similar to a laptop running Netflix. Fundamentally, servers are the same type of thing as laptops, they’re just used for different purposes.

Compute Resource: A server, but in the cloud. This is one of the biggest simplifications in this list, but it’s good enough to get the context of most conversations. Engineers mostly say “server” even when they technically mean “compute resource”. See “serverless” below.

Infrastructure: A bunch of servers all hooked together. Infrastructure includes all the connecting bits (like the networks that they use to communicate). Individual servers aren’t good for much without the infrastructure they live in. Modern golfing websites run on complex infrastructures, not on individual servers. Infrastructure comes in endless varieties.

Serverless: Technically a better way to say this is “serverless platform”, but a lot of people just say “serverless”. A type of compute resource that doesn’t require you to manage your own servers. That reduces the amount of deployment automation that DevOps engineers have to write. Today, not all products are compatible with serverless. Serverless platforms are services sold as part of clouds, and each one is different. If your application works in one serverless platform it may not work in another. It’s common to say “going serverless” when you mean “assigning our application developers to make our product compatible with Amazon’s lambda serverless platform (because we’re tired of managing servers)”.

Containers: Containers allow engineers to create mini-servers for their products that can be easily started and stopped on whatever infrastructure needs them. This simplifies deploying the same product to different infrastructures (e.g. you might sell it as a product that multiple customers would each want to run in their own infrastructure). It can also simplify adding and removing capacity because it’s easy to add and remove more copies of the same container.

I’m going to pause the list here and note that servers, compute resources, serverless platforms, and containers are all interconnected concepts that can combine and overlap in endless varieties. A lot of the work done by DevOps engineers today is around deciding which patterns of these to use.

Deployment: The golfing website runs on infrastructure. To run, it has to be deployed. Code has to be copied over, configuration entered, commands run. Similar to how you have to install the Netflix app on a laptop before you can stream video. Together, the outcome of these actions is the deployment.

Deployment Automation: Software that deploys other software to infrastructure. It’s cheaper and more reliable to build a tool to deploy your product than to let an error-prone human do it by hand. Today, most golfing websites have two major components: the actual product code and the deployment automation code that manages its infrastructure.

Deployment Pipeline: Tooling built around deployment automation that delivers the golfing website to infrastructure. Like any software, deployment automation has to actually run somewhere (e.g. on compute resources). The deployment pipeline is that somewhere. You might ask, “what runs the deployment pipeline?” A fair question with no easy answer. This is a chicken-and-egg situation and the implementations vary a lot. Typically the pipeline and the deployment automation are part of the same code, but that’s not something that matters much outside of an engineer’s world.

Build Pipeline: This is beyond the scope of a cloud/DevOps list, but it’s worth distinguishing from deployment pipelines. Build pipelines are the tools that deliver the golfing website code to deployment automation. They’ll do things like run tests to see if there are bugs, do some formatting to make it easier to deploy, etc.

Build: A packaged version of the golfing website that’s ready to deploy. Typically this is the output of a build pipeline. It’s possible to deploy software that hasn’t been “built”, but that’s generally considered a bad practice. The details here vary a lot, but it’s usually good enough to know that a build is the outcome of application development and is also the thing that is deployed to infrastructure.

Release: A version of the golfing website. There is usually a “build” of a “release”. The distinction isn’t important in very many non-technical conversations. This can also be a verb: “we’re going to release the latest version of the golfing website on Thursday”.

The Cloud: A misnomer. There isn’t a cloud, there are many clouds. Clouds are products owned by corporations. Clouds provide infrastructure where you can run golfing websites. Each cloud is different, and if you build a product on one it won’t (easily) work on another. Typically clouds allow you to increase and decrease what you use (and pay for) day to day. Historically, you’d have to buy enough servers to handle your most busy day even if that meant a bunch of it sat idle on your least busy day. Clouds have grown far beyond just that one benefit, they provide all kinds of ancillary services, but at the core their value is on-demand pricing. You pay for what you’re using right now, not what you might need to use tomorrow.

AWS: Amazon Web Services. A cloud. Owned by Amazon. Distinct from amazon.com. Amazon.com is an e-commerce product that is deployed to AWS. If someone says they’re going to “the cloud”, they likely mean AWS. At time of writing, AWS had the largest market share of all the clouds.

Azure: A cloud. Owned by Microsoft.

Google Cloud: A cloud. Owned by Google. Distinct from the Google search engine.

Application Developer: An engineer who writes the golfing website code.

System Administrator: Also called a sysadmin. An engineer who manually deploys the golfing website to infrastructure. These roles have been mostly replaced by DevOps.

Operator: A technician who monitors running infrastructure and responds if there are problems (so if golfers report that they can’t get to the golfing site, an operator will be the first person to do something about it). In environments without automation, operators are also typically responsible for deploying code to infrastructure. Increasingly these roles are being replaced by automation developed by DevOps Engineers.

DevOps Engineer: An engineer who writes deployment automation. So if you want your golfing website deployed to the AWS cloud, you’d need a DevOps engineer to write automation to do that. DevOps roles often include other responsibilities, but this is the core.

SRE: Site Reliability Engineer. Usually this is the same role as DevOps engineer, just under a different name. ⬅️ This definition will start fights with a lot of people. I recommend never saying this. It’s enough to know that SREs typically have very similar jobs to DevOps engineers.

I hope this helped! Happy project managing,

Adam

PowerShell Help Commands For Linux Users

Hello!

Microsoft’s Azure keeps growing. Azure isn’t all Windows, but Windows is a great reason to use Azure and a lot of Windows workloads get run there. In today’s Windows, PowerShell is the engineer’s interface to the OS and is often the language behind automation. I decided I needed to know PowerShell to keep up with the industry.

You can run PowerShell on OSX! There are a few differences (e.g. it’s still running inside of the Terminal app so things like ctrl+u work, but in native PowerShell on Windows they don’t), but the basic system is the same. I’ve been running it day to day so I can get used to it.

In bash, there are a bunch of commands I use when I’m trying to figure things out. Like searching the shell aliases or the man pages database. I found plenty of great guides on the basics of PowerShell, but it took some fiddling to find all the parallel pwsh commands for my “figuring things out” commands. I also found a couple handy commands that don’t exist directly in bash (that I know of). Here’s a table:

Bash PowerShell
man cd
Get-Help cd
man -k 'search string'
Get-Help 'search string'
alias
Get-Alias
Not available. Searches for aliases of a specific cmdlet (‘out-host’ in this example).
Get-Alias -Definition 'out-host'
alias | grep move
Get-Alias | Out-String -stream | Select-String -Pattern 'move'
env
Get-ChildItem env:
env | grep PATH
Get-ChildItem env: | Out-String -stream | sls -Pattern 'PATH'
Not available. Wildcard-matches environment variable names but not values (‘PATH’ in this example).
Get-ChildItem env:*PATH*

One other tip is that PowerShell has great tab-completion for commands. It also follows a consistent Verb-Noun convention, so you can try to guess the verb and tab-complete your way to the command. Like if you wanted to know the time, try “get” and hit tab twice and you’ll get a list that has “Get-Date” in it.

Hope that saves you some time!

Adam

Simple Python boto3 Logging

Hello!

If you’re writing a lambda function, check out this article instead.

The best way to log output from boto3 is with Python’s logging library. The core docs have a nice tutorial (⬅️ check this out if you need more details on things like what log levels are and how to add timestamps to messages).

Good libraries, like boto, use Python’s logging library internally. If you set up a logger using the same library, boto’s logs will output automatically (along with your own logs). If you use print() statements, the only boto output you’ll see is what you capture and print yourself.

Here’s what I do for (simple) boto scripts:

import logging
import boto3

if __name__ == '__main__':
    logging.basicConfig(level=logging.INFO)
    logger = logging.getLogger()
    logger.info('Connecting to EC2...')
    ec2 = boto3.client('ec2')

That’s it! The basicConfig() function sets up the root logger for you at whatever level you choose. Anything you output in info() lines will appear. Anything boto logged at the info level (or higher) will appear.

If you set the level to INFO, you’ll get limited output:

INFO:root:Connecting to EC2...
INFO:botocore.credentials:Found credentials in environment variables.
Traceback (most recent call last):
  File "test.py", line 11, in <module>
    ec2 = boto3.client('ec2')
  File "/Users/adam/opt/env3/lib/python3.7/site-packages/boto3/__init__.py", line 91, in client
    return _get_default_session().client(*args, **kwargs)
  File "/Users/adam/opt/env3/lib/python3.7/site-packages/boto3/session.py", line 263, in client
    aws_session_token=aws_session_token, config=config)
  File "/Users/adam/opt/env3/lib/python3.7/site-packages/botocore/session.py", line 838, in create_client
    client_config=config, api_version=api_version)
  File "/Users/adam/opt/env3/lib/python3.7/site-packages/botocore/client.py", line 86, in create_client
    verify, credentials, scoped_config, client_config, endpoint_bridge)
  File "/Users/adam/opt/env3/lib/python3.7/site-packages/botocore/client.py", line 328, in _get_client_args
    verify, credentials, scoped_config, client_config, endpoint_bridge)
  File "/Users/adam/opt/env3/lib/python3.7/site-packages/botocore/args.py", line 47, in get_client_args
    endpoint_url, is_secure, scoped_config)
  File "/Users/adam/opt/env3/lib/python3.7/site-packages/botocore/args.py", line 117, in compute_client_args
    service_name, region_name, endpoint_url, is_secure)
  File "/Users/adam/opt/env3/lib/python3.7/site-packages/botocore/client.py", line 402, in resolve
    service_name, region_name)
  File "/Users/adam/opt/env3/lib/python3.7/site-packages/botocore/regions.py", line 122, in construct_endpoint
    partition, service_name, region_name)
  File "/Users/adam/opt/env3/lib/python3.7/site-packages/botocore/regions.py", line 135, in _endpoint_for_partition
    raise NoRegionError()
botocore.exceptions.NoRegionError: You must specify a region.

If you turn the level down to DEBUG, boto will tell you everything:

INFO:root:Connecting to EC2...
DEBUG:botocore.hooks:Changing event name from creating-client-class.iot-data to creating-client-class.iot-data-plane
DEBUG:botocore.hooks:Changing event name from before-call.apigateway to before-call.api-gateway
DEBUG:botocore.hooks:Changing event name from request-created.machinelearning.Predict to request-created.machine-learning.Predict
DEBUG:botocore.hooks:Changing event name from before-parameter-build.autoscaling.CreateLaunchConfiguration to before-parameter-build.auto-scaling.CreateLaunchConfiguration
DEBUG:botocore.hooks:Changing event name from before-parameter-build.route53 to before-parameter-build.route-53
DEBUG:botocore.hooks:Changing event name from request-created.cloudsearchdomain.Search to request-created.cloudsearch-domain.Search
DEBUG:botocore.hooks:Changing event name from docs.*.autoscaling.CreateLaunchConfiguration.complete-section to docs.*.auto-scaling.CreateLaunchConfiguration.complete-section
DEBUG:botocore.hooks:Changing event name from before-parameter-build.logs.CreateExportTask to before-parameter-build.cloudwatch-logs.CreateExportTask
DEBUG:botocore.hooks:Changing event name from docs.*.logs.CreateExportTask.complete-section to docs.*.cloudwatch-logs.CreateExportTask.complete-section
DEBUG:botocore.hooks:Changing event name from before-parameter-build.cloudsearchdomain.Search to before-parameter-build.cloudsearch-domain.Search
DEBUG:botocore.hooks:Changing event name from docs.*.cloudsearchdomain.Search.complete-section to docs.*.cloudsearch-domain.Search.complete-section
DEBUG:botocore.credentials:Looking for credentials via: env
INFO:botocore.credentials:Found credentials in environment variables.
DEBUG:botocore.loaders:Loading JSON file: /Users/adam/opt/env3/lib/python3.7/site-packages/botocore/data/endpoints.json
DEBUG:botocore.hooks:Event choose-service-name: calling handler <function handle_service_name_alias at 0x10cfb3048>
DEBUG:botocore.loaders:Loading JSON file: /Users/adam/opt/env3/lib/python3.7/site-packages/botocore/data/ec2/2016-11-15/service-2.json
DEBUG:botocore.hooks:Event creating-client-class.ec2: calling handler <function add_generate_presigned_url at 0x10cf70ae8>
Traceback (most recent call last):
  File "test.py", line 11, in <module>
    ec2 = boto3.client('ec2')
  File "/Users/adam/opt/env3/lib/python3.7/site-packages/boto3/__init__.py", line 91, in client
    return _get_default_session().client(*args, **kwargs)
  File "/Users/adam/opt/env3/lib/python3.7/site-packages/boto3/session.py", line 263, in client
    aws_session_token=aws_session_token, config=config)
  File "/Users/adam/opt/env3/lib/python3.7/site-packages/botocore/session.py", line 838, in create_client
    client_config=config, api_version=api_version)
  File "/Users/adam/opt/env3/lib/python3.7/site-packages/botocore/client.py", line 86, in create_client
    verify, credentials, scoped_config, client_config, endpoint_bridge)
  File "/Users/adam/opt/env3/lib/python3.7/site-packages/botocore/client.py", line 328, in _get_client_args
    verify, credentials, scoped_config, client_config, endpoint_bridge)
  File "/Users/adam/opt/env3/lib/python3.7/site-packages/botocore/args.py", line 47, in get_client_args
    endpoint_url, is_secure, scoped_config)
  File "/Users/adam/opt/env3/lib/python3.7/site-packages/botocore/args.py", line 117, in compute_client_args
    service_name, region_name, endpoint_url, is_secure)
  File "/Users/adam/opt/env3/lib/python3.7/site-packages/botocore/client.py", line 402, in resolve
    service_name, region_name)
  File "/Users/adam/opt/env3/lib/python3.7/site-packages/botocore/regions.py", line 122, in construct_endpoint
    partition, service_name, region_name)
  File "/Users/adam/opt/env3/lib/python3.7/site-packages/botocore/regions.py", line 135, in _endpoint_for_partition
    raise NoRegionError()
botocore.exceptions.NoRegionError: You must specify a region.

See how it started saying where it found AWS credentials? Imagine you’re trying to figure out why your script worked locally but didn’t work on an EC2 instance; knowing where it found keys is huge. Maybe there are some hardcoded ones you didn’t know about that it’s picking up instead of the IAM role you attached to the instance. In DEBUG mode that’s easy to figure out. With print you’d have to hack it out yourself.

This is great for simple scripts, but for something you’re going to run in production I recommend this pattern.

Happy automating!

Adam

DevOps FMEA Shortlist

Update June 2019: Added cache server failures.

Failure Modes and Effects Analysis (FMEA) is a formal practice of breaking individual pieces of your system and watching how each failure effects the system as a whole (this is the definition I’ve seen used most in tech). Stop one server in a pool and capacity drops. No problem. Stop one server in a pool and users can’t login anymore? Problem. FMEA helps ensure that downstream effects of failures aren’t bigger than you expect.

Real FMEA is a Big Deal and is often out of reach. But, even doing small FMEA during routine development can save your life. I love a good list, and over the years I’ve assembled one from real world failures that had bigger consequences than expected. Testing those failures has saved me many times, so I thought I’d share the list.

Some failures might look the same, like rebooting and terminating, but the differences matter. I’ve seen a lot of infrastructures go down because deployment automation started app processes with nohup and didn’t create an init config to ensure they restarted on reboot.

Failure Expected Outcome
Clocks are slow (e.g. ntpd died yesterday) Alerts to operators who can respond
Config that prevents the app from starting is deployed Clear failure message from deploys (but no outage in the running system)
External service goes down (e.g. PyPI) Clear failure message from deploys (but no outage in the running system)
Database is out of space Alerts to operators who can respond
App process is killed by the Linux OOM Alerts to operators and a short drop in capacity that recovers automatically
That singleton host you haven’t gotten rid of terminates Brief outage that recovers automatically
App host reboots Short drop in capacity that recovers automatically
App host terminates Short drop in capacity that recovers automatically
App service stops on a host Short drop in capacity that recovers automatically
All app hosts terminate Brief outage that recovers automatically
Worker host reboots Short spike in queue length
Worker host terminates Short spike in queue length
Worker service stops on a host Short spike in queue length
All worker hosts terminate Slightly longer spike in queue length
Cache host reboots Temporarily increased response times, some users are logged out
Cache host terminates Temporarily increased response times, some users are logged out
Cache service stops on a host Temporarily increased response times, some users are logged out

There’s a lot more you’d have to test to fully exercise your system, this is just a lightweight list of the failures I’ve found most valuable to test. I usually don’t run every simulation when developing every feature, just the ones closely connected to what I’m building. It’s a good practice to simulate several failures at once. What happens if an external service like PyPI is down and a host is terminated?

The full details on why each of these is important is out of scope today, but if you’re interested in them let me know and I’ll look at covering them in future articles.

If you’re already thinking of modifications you’d need to make for this list to work for you, that’s good! Maybe you don’t depend on time sync (although you likely do, so many base components do that it’s hard to avoid). Maybe you have a legacy queue system that has unique failure cases (like that time the instance type changed which changed the processor count which made it run out of licenses). Maybe you don’t depend on singletons (huzzah!). Adjust as needed.

Happy automating!

Adam

CF Custom Resources: Avoiding the Two Hour Exception Timeout

ManagementTools_GRAYSCALE_AWSCloudFormation

There’s a gotcha when writing CloudFormation Custom Resources that’s easy to miss and if you miss it your stack can get stuck, ignoring its timeout setting. It’ll fail on its own after an hour, but if it tries to roll back you have to wait a second hour. Here’s how to avoid that.

This post assumes you’re already working with Custom Resources and that yours are backed by lambda.

Here’s an empty custom resource:

import logging
import cfnresponse

def handler(event, context):
    logger = logging.getLogger()
    logger.setLevel(logging.INFO)

    if event['RequestType'] == 'Delete':
        logger.info('Deleted!')
        cfnresponse.send(event, context, cfnresponse.SUCCESS, {})
        return

    logger.info('It worked!')
    cfnresponse.send(event, context, cfnresponse.SUCCESS, {})

It’s a successful no-op:

SuccessfulNoOp

Now let’s add an exception:

import logging
import cfnresponse

def handler(event, context):
    logger = logging.getLogger()
    logger.setLevel(logging.INFO)

    if event['RequestType'] == 'Delete':
        logger.info('Deleted!')
        cfnresponse.send(event, context, cfnresponse.SUCCESS, {})
        return

    raise Exception
    logger.info('It worked!')
    cfnresponse.send(event, context, cfnresponse.SUCCESS, {})

We can see the exception in the logs:

ExceptionThreeRetries

But, then the stack gets stuck because the cfnresponse callback never happened and CF doesn’t know there was a problem:

FailureTimeouts

It took exactly an hour to fail, which suggests CF hit some internal, fallback timeout. My stack timeout was set to five minutes. We can see it retry the lambda function once a minute for three minutes, but then it never tries again in the remaining 57 minutes. I got the same delays in reverse when it tried to roll back (which is really just another update to the previous state). And, since the rollback failed, I had to manually edit the lambda function code and remove the exception to get it to finish rolling back.

Maybe this is a bug? Either way, there’s a workaround.

You should usually only catch specific errors that you know how to handle. It’s an anti-pattern to use except Exception. But, in this case we need to guarantee that the callback always happens. In this one situation (not in general) we need to catch all exceptions:

import logging
import cfnresponse

def handler(event, context):
    logger = logging.getLogger()
    logger.setLevel(logging.INFO)

    try:
        if event['RequestType'] == 'Delete':
            logger.info('Deleted!')
            cfnresponse.send(event, context, cfnresponse.SUCCESS, {})
            return

        raise Exception
        logger.info('It worked!')
        cfnresponse.send(event, context, cfnresponse.SUCCESS, {})
    except Exception:
        logger.exception('Signaling failure to CloudFormation.')
        cfnresponse.send(event, context, cfnresponse.FAILED, {})

Now, the failure is visible to CF and it doesn’t wait:

ExceptionHandled.png

You should use this pattern in every Custom Resource: catch all exceptions and return a FAILED result to CF. You can still catch more specific exceptions inside the catchall try/except, ones specific to the feature you’re implementing, but you need that catchall to ensure the result returns when the unexpected happens.

Happy automating!

Adam

Lamba: Filter boto3’s Logs into CloudWatch

Compute_GRAYSCALE_AWSLambda

Good morning!

If you’re a script with boto (i.e. not a lambda function), check out this article instead.

For those custom cases that don’t fit into Terraform or CloudFormation, a little bit of Python and some boto3 in a lambda function can save you. Lambda captures the output of both print() and logging.Logger calls into CloudWatch so it’s easy to log information about what your code is doing. When things go wrong, though, I often find that just the output I wrote doesn’t give me enough to diagnose the problem. In those cases, it’s helpful to see the log output both for your code and boto3. Here’s how you do that.

Use the logging library. It’s a Python core library that provides standard features like timestamped prefixes and support for levels (e.g. INFO or DEBUG). For simple deployment helpers this is usually all you need:

logger = logging.getLogger(logging.INFO)
logger.info('Message at the INFO level.')
logger.debug('Message at the DEBUG level.')

This sets the root logger (which sees all log messages) to the INFO level. Normally you’d have to configure the root logger, but lambda does that automatically (which is actually annoying if you need to change your formatter, but that’s for another post). Now, logger.info() calls will show up in the logs and logger.debug() calls won’t. If you increase the level to DEBUG you’ll see both.

Because logging is the standard Python way to handle log output, maintainers of libraries like boto3 use it throughout their code to show what the library is doing (and they’re usually smart about choosing what to log at each level). By setting a level on the root logger, you’re choosing which of your output to capture and which of boto3’s output to capture. Powerful when you’re diagnosing a failure.

Here’s a demo function to show how the output looks. You might notice that it puts the logger setup calls inside the handler even though the AWS docs tell you to put them under the import. Function calls made directly in modules (e.g. not inside functions declared within the module) are import-side effects and import side-effects are an anti-pattern. I put the calls in the handler so they only run when the handler is called. This isn’t likely to matter much in a lambda function, but I like to stick to good patterns.

import logging

import boto3

def lambda_handler(event, context):
    logger = logging.getLogger()
    logger.setLevel(logging.INFO)

    client = boto3.client('sts')
    account_id = client.get_caller_identity()['Account']

    logger.info('Getting account ID...')
    logger.debug('Account ID: {}'.format(account_id))
    return account_id

This is the output when run at the INFO level:

START RequestId: a61471fe-c3fd-11e8-9f43-bdb22e22a203 Version: $LATEST
[INFO]	2018-09-29T15:38:01.882Z	a61471fe-c3fd-11e8-9f43-bdb22e22a203	Found credentials in environment variables.
[INFO]	2018-09-29T15:38:02.83Z	a61471fe-c3fd-11e8-9f43-bdb22e22a203	Starting new HTTPS connection (1): sts.amazonaws.com
[INFO]	2018-09-29T15:38:02.531Z	a61471fe-c3fd-11e8-9f43-bdb22e22a203	Getting account ID...
END RequestId: a61471fe-c3fd-11e8-9f43-bdb22e22a203
REPORT RequestId: a61471fe-c3fd-11e8-9f43-bdb22e22a203	Duration: 734.96 ms	Billed Duration: 800 ms Memory Size: 128 MB	Max Memory Used: 29 MB

This is the output when run at the DEBUG level:

START RequestId: 9ea3bbef-c3fe-11e8-8eb1-730a799b5405 Version: $LATEST
[DEBUG]	2018-09-29T15:44:58.850Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Loading variable profile from defaults.
[DEBUG]	2018-09-29T15:44:58.880Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Loading variable config_file from defaults.
[DEBUG]	2018-09-29T15:44:58.881Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Loading variable credentials_file from defaults.
[DEBUG]	2018-09-29T15:44:58.881Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Loading variable data_path from defaults.
[DEBUG]	2018-09-29T15:44:58.881Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Loading variable region from environment with value 'us-west-2'.
[DEBUG]	2018-09-29T15:44:58.900Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Loading variable profile from defaults.
[DEBUG]	2018-09-29T15:44:58.900Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Loading variable ca_bundle from defaults.
[DEBUG]	2018-09-29T15:44:58.900Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Loading variable profile from defaults.
[DEBUG]	2018-09-29T15:44:58.900Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Loading variable api_versions from defaults.
[DEBUG]	2018-09-29T15:44:58.901Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Loading variable profile from defaults.
[DEBUG]	2018-09-29T15:44:58.901Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Loading variable credentials_file from defaults.
[DEBUG]	2018-09-29T15:44:58.901Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Loading variable config_file from defaults.
[DEBUG]	2018-09-29T15:44:58.901Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Loading variable profile from defaults.
[DEBUG]	2018-09-29T15:44:58.901Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Loading variable metadata_service_timeout from defaults.
[DEBUG]	2018-09-29T15:44:58.901Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Loading variable profile from defaults.
[DEBUG]	2018-09-29T15:44:58.901Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Loading variable metadata_service_num_attempts from defaults.
[DEBUG]	2018-09-29T15:44:58.942Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Loading variable profile from defaults.
[DEBUG]	2018-09-29T15:44:58.960Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Looking for credentials via: env
[INFO]	2018-09-29T15:44:58.960Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Found credentials in environment variables.
[DEBUG]	2018-09-29T15:44:58.961Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Loading JSON file: /var/runtime/botocore/data/endpoints.json
[DEBUG]	2018-09-29T15:44:59.1Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Loading variable profile from defaults.
[DEBUG]	2018-09-29T15:44:59.20Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Event choose-service-name: calling handler
[DEBUG]	2018-09-29T15:44:59.60Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Loading JSON file: /var/runtime/botocore/data/sts/2011-06-15/service-2.json
[DEBUG]	2018-09-29T15:44:59.82Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Event creating-client-class.sts: calling handler
[DEBUG]	2018-09-29T15:44:59.100Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	The s3 config key is not a dictionary type, ignoring its value of: None
[DEBUG]	2018-09-29T15:44:59.103Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Setting sts timeout as (60, 60)
[DEBUG]	2018-09-29T15:44:59.141Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Loading JSON file: /var/runtime/botocore/data/_retry.json
[DEBUG]	2018-09-29T15:44:59.141Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Registering retry handlers for service: sts
[DEBUG]	2018-09-29T15:44:59.160Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Event before-parameter-build.sts.GetCallerIdentity: calling handler
[DEBUG]	2018-09-29T15:44:59.161Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Making request for OperationModel(name=GetCallerIdentity) (verify_ssl=True) with params: {'url_path': '/', 'query_string': '', 'method': 'POST', 'headers': {'Content-Type': 'application/x-www-form-urlencoded; charset=utf-8', 'User-Agent': 'Boto3/1.7.74 Python/3.6.1 Linux/4.14.62-65.117.amzn1.x86_64 exec-env/AWS_Lambda_python3.6 Botocore/1.10.74'}, 'body': {'Action': 'GetCallerIdentity', 'Version': '2011-06-15'}, 'url': 'https://sts.amazonaws.com/', 'context': {'client_region': 'us-west-2', 'client_config': , 'has_streaming_input': False, 'auth_type': None}}
[DEBUG]	2018-09-29T15:44:59.161Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Event request-created.sts.GetCallerIdentity: calling handler
[DEBUG]	2018-09-29T15:44:59.161Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Event choose-signer.sts.GetCallerIdentity: calling handler
[DEBUG]	2018-09-29T15:44:59.162Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Calculating signature using v4 auth.
[DEBUG]	2018-09-29T15:44:59.180Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	CanonicalRequest:
POST
/

content-type:application/x-www-form-urlencoded; charset=utf-8
host:sts.amazonaws.com
x-amz-date:20180929T154459Z
x-amz-security-token:FQoGZXIvYXdzEKn//////////wEaDOOlIItIhtRakeAyfCLrAWPZXQJFkNrDZNa4Bny102eGKJ5KWD0F+ixFqZaW+A9mgadICpLRxBG4JGUzMtPTDeqxPoLT1qnS6bI/jVmXXUxjVPPMRiXdIlP+li0eFyB/xOK+PN/DOiByee0eu6bjQmkjoC3P5MREvxeanPY7hpgXNO52jSBPo8LMIdAcjCJxyRF7GHZjtZGAMARQWng6DJa9RAiIbxOmXpSbNGpABBVg/TUt8XMUT+p9Lm2Txi10P0ueu1n5rcuxJdBV8Jr/PUF3nZY+/k7MzOPCnzZNqVgpDAQbwby+AVIQcvVwaKsXePqubCqBTHxoh/Mo0ay+3QU=

content-type;host;x-amz-date;x-amz-security-token
ab821ae955788b0e33ebd34c208442ccfc2d406e2edc5e7a39bd6458fbb4f843
[DEBUG]	2018-09-29T15:44:59.181Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	StringToSign:
AWS4-HMAC-SHA256
20180929T154459Z
20180929/us-east-1/sts/aws4_request
7cf0af0e8f55fb1b9c0009104aa8f141097f00fea428ddf1654321e7054a920d
[DEBUG]	2018-09-29T15:44:59.181Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Signature:
c00de0a12c9ee0fce348df452f2833749b854915db58f8d106e3166545a70c43
[DEBUG]	2018-09-29T15:44:59.183Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Sending http request:
[INFO]	2018-09-29T15:44:59.201Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Starting new HTTPS connection (1): sts.amazonaws.com
[DEBUG]	2018-09-29T15:44:59.628Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	"POST / HTTP/1.1" 200 461
[DEBUG]	2018-09-29T15:44:59.628Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Response headers: {'x-amzn-requestid': '9f421e56-c3fe-11e8-b622-2d5da14a8dc9', 'content-type': 'text/xml', 'content-length': '461', 'date': 'Sat, 29 Sep 2018 15:44:58 GMT'}
[DEBUG]	2018-09-29T15:44:59.640Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Response body:
b'\n \n arn:aws:sts::268133297303:assumed-role/demo-boto3-logging/demo-boto3-logging\n AROAITTVSA67NGZPH2QZI:demo-boto3-logging\n 268133297303\n \n \n 9f421e56-c3fe-11e8-b622-2d5da14a8dc9\n \n\n'
[DEBUG]	2018-09-29T15:44:59.640Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Event needs-retry.sts.GetCallerIdentity: calling handler
[DEBUG]	2018-09-29T15:44:59.641Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	No retry needed.
[INFO]	2018-09-29T15:44:59.641Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Getting account ID...
[DEBUG]	2018-09-29T15:44:59.641Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Account ID: 268133297303
END RequestId: 9ea3bbef-c3fe-11e8-8eb1-730a799b5405
REPORT RequestId: 9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Duration: 813.73 ms	Billed Duration: 900 ms Memory Size: 128 MB	Max Memory Used: 29 MB

boto3 can be very verbose in DEBUG so I recommend staying at INFO unless you’re actively troubleshooting.

Happy debugging!

Adam

Securing IAM Policies

SecurityIdentityCompliance_GRAYSCALE_IAM

Since the beginning, writing IAM policies with the minimum necessary permissions has been hard. Some services don’t have resource-level permissions (you have to grant to *), but then later they do. When a service has resource-level permissions, it may only be for some of its permissions (the rest still need *). Some services have their own Condition Operators (separate from the global ones) that may or may not help you tighten control. Et cetera. The details are documented differently for each service and it’s a lot of hunting and testing to try to put together a tight policy.

Amazon made it easier! There’s new magic in the IAM UI to help you create policies. It has some limitations, but it’s a big improvement. Here are some of the things it can do that I used to have to do myself:

  • Knows which S3 permissions require the resource list to include a bucket name and which require the bucket name and an object path.StatementSplitting
  • Tries to group permissions and resources into statements when it results in equivalent access (but sometimes ends up granting extra access, see below).StatementGrouping
  • Knows when a service doesn’t support resource-level permissions.ResourceSpecificPermissionsDetection
  • Knows about the Condition Operators specific to each service (not just the global ones).ConditionOperators

There are some limitations:

  • Doesn’t deduplicate. If you add permissions it doesn’t go back and put them into existing statements, it just adds new statements that may duplicate parts of old ones.
  • Only generates JSON, so if you’re writing a YAML CloudFormation template you should translate.
  • Seems to have limited form validation on Condition Operators. You can put in strings that will never match because the API calls for that service can’t contain what you entered (making the statement a no-op).
  • Can end up grouping permissions in a way that makes some resource restrictions meaningless and grants more access than might be expected.TooMuchPermission
  • Sometimes it messes up the syntax. Seems to happen if you don’t put exactly what it expects into the forms.Bug

 

So there are a few problems, but this is still way better than it was before! My plan is to use the visual editor to write policies, then go through and touch it up afterward. Based on what I’ve seen so far, this cuts the time it takes me to develop policies by about 30%.

Happy securing,

Adam

Beating EC2 Security Groups

 

NetworkingContentDelivery_GRAYSCALE_AmazonVPC

Today I’ll show you how to pass traffic through an EC2 Security Group that’s configured not to allow that traffic.

This isn’t esoteric hacking, it’s a detail in the difference between config and state that’s easy to miss when you’re operating an infrastructure.

Like I showed in a previous post, EC2 Security Groups are stateful. They know the difference between the first packet of a new connection and packets that are part of connections that are already established.

This statefulness is why you can let host A SSH to host B just by allowing outgoing SSH on A’s SG and incoming SSH on B’s SG. B doesn’t need to allow outgoing SSH because it knows the return traffic is part of a connection that was already allowed. Similarly for A and incoming SSH.

Here’s the detail of today’s post: if the Security Group sees traffic as part of an established connection, it’ll allow it even if its rules say not to. Ok now let’s break a Security Group.

The Lab

Two hosts, testa and testb. One SG for each, both allowing all outgoing traffic. Testb’s SG allows incoming TCP on port 4321 (a random ephemeral port I’m using for this test):

TrafficAllowed

To test traffic flow, I’m going to use nc. It’s a common Linux utility that sends and receives TCP traffic:

  • Listen: nc -l [port]
  • Send: nc [host] [port]

Test Steps:

(screenshots of shell output below)

  1. Listen on port 4321 on testb.
  2. Start a connection from testa to port 4321 on testb.
  3. Send a message. It’s delivered, as expected.
  4. Remove testb’s SG rule allowing port 4321:TrafficDenied
  5. Send another message through the connection. It will get through! There’s no rule to allow it, but it still gets through.

WAT.

To show nothing else was going on, let’s redo the test with the security group as it is now (no rule allowing 4321).

  1. Quit nc on testa to close the connection. You’ll see it also close on testb.
  2. Listen on port 4321 on testb.
  3. Start a connection from tests a to port 4321 on testb.
  4. Send a message. Not delivered. This time there was no established connection so the traffic was compared to the SGs rules. There was no rule to allow it, so it was denied.

Testb Output

(where we listened)

testb

Only two messages got through.

Testa Output

(where we sent)

testa

We sent three messages. The last two were sent while the SG had the same rules, but the first message was allowed and the second was denied.

Beware!

The rules in EC2 Security Groups don’t apply to open (established) TCP connections. If you need to ensure traffic isn’t flowing between two instances you can’t just remove rules from your SGs. You have to close all open connections.

Happy securing,

Adam

The Fallacy of Rest

Hello!

A while back I made a bad scheduling mistake. I knew about the anti-pattern that caused it, but didn’t see myself using it. It forced me to push out dates that cost me some money.

Later I looked back to see what went wrong. It was exactly what I have advised others not to do. It’s easy to miss! I’m writing this article to re-expose the anti-pattern I used.

The project was Move to a New City. I would be taking my job with me. This is the schedule I wrote:

  • Week 1
    • Pack
    • Work
  • Week 2
    • Weekdays
      • Pack
      • Work
      • Clean
    • Weekend
      • Clean
      • Say goodbye to friends
  • Week 3
    • Monday (Vacation Day)
      • Exercise and rest
      • Say goodbye to friends
    • Tuesday (Vacation Day)
      • Return keys
      • Drive to new city (5 hours on the road)
      • Check in to AirBnB
      • Hang out with friend who lives in new city
    • Wednesday through Friday
      • Work
      • Look at new housing

Seems fine! I even budgeted time to exercise.

Tuesday of week 3. 100% on schedule. It’s bedtime and I’m watching an episode of The Dick Van Dyke Show on my laptop and laughing myself to sleep with Mary Tyler Moore’s performance. I feel awesome. I sleep like I’ve just run a marathon.

Wednesday. Mild headache (whatever – I’m an engineer, we get headaches). I catch up on work, message about a couple rentals, and attend the morning meetings. As the meetings are wrapping up I get a reply on a rental with a proposed time to view it. I can just barely make it, so I head out.

See the mistake yet? I still hadn’t. Wednesday was a busy day and I felt rushed, but I’ve had lots of busy days. I just kept going. I didn’t make the mistake on Wednesday.

That afternoon I got one more email about a rental. It was a wafer-thin mint (see Monty Python’s The Meaning of Life ⬅️ this is how I am making the post about Python). Suddenly getting through the rest of my inbox felt like climbing a mountain. I was burnt out.

The mistake happened when I first wrote the schedule. Here’s the fallacy I used:

People are like horses. Rest them two hours a day and one full day every week or so and they’re fine. Feed and water three times a day.

People are not like horses. They can’t sustain themselves on periodic rest intervals.

Here’s how people work:

Productive workers have a budget of hours per week. When those hours are spent they spend themselves to keep going. Once too much of themselves is gone, they stop producing.

I wrote a schedule in the mindset of making sure I had rest intervals, but I should have figured out the hours needed and divided that by my sustainable weekly hours (a number I’ve learned during two decades of working). That would be the total weeks really needed to complete the move.

Going back over the hours I spent I found I had scheduled 200% of my sustainable capacity and had expected to sustain that for most of a month. (╯°□°)╯︵ ┻━┻

Another way to look at my mistake is that I didn’t count saying goodbye to friends as work (just like I sometimes forget to count attending meetings as work). In the context of human capacity, leaving behind your friends is absolutely work (just like sitting in a frustrating meeting is). It drains your budget of hours. If you do too much of it, you exhaust.

To write a schedule that workers can reliably complete, budget based on what workers can do per week and make sure you get that amount from their real history of work. Don’t make it up, look back at the past and compute it.

I’m going to bed. Happy scheduling!

Adam

3 Tools to Validate CloudFormation

Hello!

Note: If you just want the script and don’t need the background, go to the gist.

If you found this page, SEO means you probably already found the AWS page on validating CloudFormation templates. If you haven’t, read that first. It’s a better starting place.

I run three tools before applying CF templates.

#1 AWS CLI’s validator

This is the native tool. It’s ok. It’s really only a syntax checker, there are plenty of errors you won’t see until you apply a template to a stack. Still, it’s fast and catches some things.

aws cloudformation validate-template --template-body file://./my_template.yaml

Notes:

  • The CLI has to be configured with access keys or it won’t run the validator.
  • If the template is JSON, this will ignore some requirements (e.g. it’ll allow trailing commas). However, the CF service ignores the same things.

#2 Python’s JSON library

Because the AWS CLI validator ignores some JSON requirements, I like to pass JSON templates through Python’s parser to make sure they’re valid. In the past, I’ve had to do things like load and search templates for unused parameters, etc. That’s not ideal but it’s happened a couple times while doing cleanup and refactoring of legacy code. It’s easier if the JSON is valid JSON.

It’s fiddly to run this in a shell script. I do it with a heredoc so I don’t have to write multiple scripts to the filesystem:

python - &lt;&lt;END
import json
with open(&#039;my_template.json&#039;) as f:
    json.load(f)
END

Notes:

  • I use Python for this because it’s a dependency of the AWS CLI so I know it’s already installed. You could use jq or another tool, though.
  • I don’t do the YAML equivalent of this because it errors on CF-specific syntax like !Ref.

#3 cfn-nag

This is a linter for CloudFormation. It’s not perfect. I’ve seen it generate false positives like “don’t use * in IAM policy resources” even when * is the only option because it’s all that’s supported by the service I’m writing a policy for. Still, it’s one more way to catch things before you deploy, and it catches some good stuff.

cfn_nag_scan --input-path my_template.yaml

Notes:

  • Annoyingly, this is a Ruby gem so you need a new dependency chain to install it. I highly recommend setting up RVM and creating a gemset to isolate this from your system and other projects (just like you’d do with a Python venv).

Happy automating!

Adam