CF Custom Resources: Avoiding the Two Hour Exception Timeout

ManagementTools_GRAYSCALE_AWSCloudFormation

There’s a gotcha when writing CloudFormation Custom Resources that’s easy to miss and if you miss it your stack can get stuck, ignoring its timeout setting. It’ll fail on its own after an hour, but if it tries to roll back you have to wait a second hour. Here’s how to avoid that.

This post assumes you’re already working with Custom Resources and that yours are backed by lambda.

Here’s an empty custom resource:

import logging
import cfnresponse

def handler(event, context):
    logger = logging.getLogger()
    logger.setLevel(logging.INFO)

    if event['RequestType'] == 'Delete':
        logger.info('Deleted!')
        cfnresponse.send(event, context, cfnresponse.SUCCESS, {})
        return

    logger.info('It worked!')
    cfnresponse.send(event, context, cfnresponse.SUCCESS, {})

It’s a successful no-op:

SuccessfulNoOp

Now let’s add an exception:

import logging
import cfnresponse

def handler(event, context):
    logger = logging.getLogger()
    logger.setLevel(logging.INFO)

    if event['RequestType'] == 'Delete':
        logger.info('Deleted!')
        cfnresponse.send(event, context, cfnresponse.SUCCESS, {})
        return

    raise Exception
    logger.info('It worked!')
    cfnresponse.send(event, context, cfnresponse.SUCCESS, {})

We can see the exception in the logs:

ExceptionThreeRetries

But, then the stack gets stuck because the cfnresponse callback never happened and CF doesn’t know there was a problem:

FailureTimeouts

It took exactly an hour to fail, which suggests CF hit some internal, fallback timeout. My stack timeout was set to five minutes. We can see it retry the lambda function once a minute for three minutes, but then it never tries again in the remaining 57 minutes. I got the same delays in reverse when it tried to roll back (which is really just another update to the previous state). And, since the rollback failed, I had to manually edit the lambda function code and remove the exception to get it to finish rolling back.

Maybe this is a bug? Either way, there’s a workaround.

You should usually only catch specific errors that you know how to handle. It’s an anti-pattern to use except Exception. But, in this case we need to guarantee that the callback always happens. In this one situation (not in general) we need to catch all exceptions:

import logging
import cfnresponse

def handler(event, context):
    logger = logging.getLogger()
    logger.setLevel(logging.INFO)

    try:
        if event['RequestType'] == 'Delete':
            logger.info('Deleted!')
            cfnresponse.send(event, context, cfnresponse.SUCCESS, {})
            return

        raise Exception
        logger.info('It worked!')
        cfnresponse.send(event, context, cfnresponse.SUCCESS, {})
    except Exception:
        logger.exception('Signaling failure to CloudFormation.')
        cfnresponse.send(event, context, cfnresponse.FAILED, {})

Now, the failure is visible to CF and it doesn’t wait:

ExceptionHandled.png

You should use this pattern in every Custom Resource: catch all exceptions and return a FAILED result to CF. You can still catch more specific exceptions inside the catchall try/except, ones specific to the feature you’re implementing, but you need that catchall to ensure the result returns when the unexpected happens.

Happy automating!

Adam

Lamba: Filter boto3’s Logs into CloudWatch

Compute_GRAYSCALE_AWSLambda

Good morning!

For those custom cases that don’t fit into Terraform or CloudFormation, a little bit of Python and some boto3 in a lambda function can save you. Lambda captures the output of both print() and logging.Logger calls into CloudWatch so it’s easy to log information about what your code is doing. When things go wrong, though, I often find that just the output I wrote doesn’t give me enough to diagnose the problem. In those cases, it’s helpful to see the log output both for your code and boto3. Here’s how you do that.

Use the logging library. It’s a Python core library that provides standard features like timestamped prefixes and support for levels (e.g. INFO or DEBUG). For simple deployment helpers this is usually all you need:

logger = logging.getLogger(logging.INFO)
logger.info('Message at the INFO level.')
logger.debug('Message at the DEBUG level.')

This sets the root logger (which sees all log messages) to the INFO level. Normally you’d have to configure the root logger, but lambda does that automatically (which is actually annoying if you need to change your formatter, but that’s for another post). Now, logger.info() calls will show up in the logs and logger.debug() calls won’t. If you increase the level to DEBUG you’ll see both.

Because logging is the standard Python way to handle log output, maintainers of libraries like boto3 use it throughout their code to show what the library is doing (and they’re usually smart about choosing what to log at each level). By setting a level on the root logger, you’re choosing which of your output to capture and which of boto3’s output to capture. Powerful when you’re diagnosing a failure.

Here’s a demo function to show how the output looks. You might notice that it puts the logger setup calls inside the handler even though the AWS docs tell you to put them under the import. Function calls made directly in modules (e.g. not inside functions declared within the module) are import-side effects and import side-effects are an anti-pattern. I put the calls in the handler so they only run when the handler is called. This isn’t likely to matter much in a lambda function, but I like to stick to good patterns.

import logging

import boto3


def lambda_handler(event, context):
    logger = logging.getLogger()
    logger.setLevel(logging.INFO)

    client = boto3.client('sts')
    account_id = client.get_caller_identity()['Account']

    logger.info('Getting account ID...')
    logger.debug('Account ID: {}'.format(account_id))
    return account_id

This is the output when run at the INFO level:

START RequestId: a61471fe-c3fd-11e8-9f43-bdb22e22a203 Version: $LATEST
[INFO]	2018-09-29T15:38:01.882Z	a61471fe-c3fd-11e8-9f43-bdb22e22a203	Found credentials in environment variables.
[INFO]	2018-09-29T15:38:02.83Z	a61471fe-c3fd-11e8-9f43-bdb22e22a203	Starting new HTTPS connection (1): sts.amazonaws.com
[INFO]	2018-09-29T15:38:02.531Z	a61471fe-c3fd-11e8-9f43-bdb22e22a203	Getting account ID...
END RequestId: a61471fe-c3fd-11e8-9f43-bdb22e22a203
REPORT RequestId: a61471fe-c3fd-11e8-9f43-bdb22e22a203	Duration: 734.96 ms	Billed Duration: 800 ms Memory Size: 128 MB	Max Memory Used: 29 MB

This is the output when run at the DEBUG level:

START RequestId: 9ea3bbef-c3fe-11e8-8eb1-730a799b5405 Version: $LATEST
[DEBUG]	2018-09-29T15:44:58.850Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Loading variable profile from defaults.
[DEBUG]	2018-09-29T15:44:58.880Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Loading variable config_file from defaults.
[DEBUG]	2018-09-29T15:44:58.881Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Loading variable credentials_file from defaults.
[DEBUG]	2018-09-29T15:44:58.881Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Loading variable data_path from defaults.
[DEBUG]	2018-09-29T15:44:58.881Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Loading variable region from environment with value 'us-west-2'.
[DEBUG]	2018-09-29T15:44:58.900Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Loading variable profile from defaults.
[DEBUG]	2018-09-29T15:44:58.900Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Loading variable ca_bundle from defaults.
[DEBUG]	2018-09-29T15:44:58.900Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Loading variable profile from defaults.
[DEBUG]	2018-09-29T15:44:58.900Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Loading variable api_versions from defaults.
[DEBUG]	2018-09-29T15:44:58.901Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Loading variable profile from defaults.
[DEBUG]	2018-09-29T15:44:58.901Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Loading variable credentials_file from defaults.
[DEBUG]	2018-09-29T15:44:58.901Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Loading variable config_file from defaults.
[DEBUG]	2018-09-29T15:44:58.901Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Loading variable profile from defaults.
[DEBUG]	2018-09-29T15:44:58.901Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Loading variable metadata_service_timeout from defaults.
[DEBUG]	2018-09-29T15:44:58.901Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Loading variable profile from defaults.
[DEBUG]	2018-09-29T15:44:58.901Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Loading variable metadata_service_num_attempts from defaults.
[DEBUG]	2018-09-29T15:44:58.942Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Loading variable profile from defaults.
[DEBUG]	2018-09-29T15:44:58.960Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Looking for credentials via: env
[INFO]	2018-09-29T15:44:58.960Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Found credentials in environment variables.
[DEBUG]	2018-09-29T15:44:58.961Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Loading JSON file: /var/runtime/botocore/data/endpoints.json
[DEBUG]	2018-09-29T15:44:59.1Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Loading variable profile from defaults.
[DEBUG]	2018-09-29T15:44:59.20Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Event choose-service-name: calling handler 
[DEBUG]	2018-09-29T15:44:59.60Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Loading JSON file: /var/runtime/botocore/data/sts/2011-06-15/service-2.json
[DEBUG]	2018-09-29T15:44:59.82Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Event creating-client-class.sts: calling handler 
[DEBUG]	2018-09-29T15:44:59.100Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	The s3 config key is not a dictionary type, ignoring its value of: None
[DEBUG]	2018-09-29T15:44:59.103Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Setting sts timeout as (60, 60)
[DEBUG]	2018-09-29T15:44:59.141Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Loading JSON file: /var/runtime/botocore/data/_retry.json
[DEBUG]	2018-09-29T15:44:59.141Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Registering retry handlers for service: sts
[DEBUG]	2018-09-29T15:44:59.160Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Event before-parameter-build.sts.GetCallerIdentity: calling handler 
[DEBUG]	2018-09-29T15:44:59.161Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Making request for OperationModel(name=GetCallerIdentity) (verify_ssl=True) with params: {'url_path': '/', 'query_string': '', 'method': 'POST', 'headers': {'Content-Type': 'application/x-www-form-urlencoded; charset=utf-8', 'User-Agent': 'Boto3/1.7.74 Python/3.6.1 Linux/4.14.62-65.117.amzn1.x86_64 exec-env/AWS_Lambda_python3.6 Botocore/1.10.74'}, 'body': {'Action': 'GetCallerIdentity', 'Version': '2011-06-15'}, 'url': 'https://sts.amazonaws.com/', 'context': {'client_region': 'us-west-2', 'client_config': , 'has_streaming_input': False, 'auth_type': None}}
[DEBUG]	2018-09-29T15:44:59.161Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Event request-created.sts.GetCallerIdentity: calling handler 
[DEBUG]	2018-09-29T15:44:59.161Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Event choose-signer.sts.GetCallerIdentity: calling handler 
[DEBUG]	2018-09-29T15:44:59.162Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Calculating signature using v4 auth.
[DEBUG]	2018-09-29T15:44:59.180Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	CanonicalRequest:
POST
/

content-type:application/x-www-form-urlencoded; charset=utf-8
host:sts.amazonaws.com
x-amz-date:20180929T154459Z
x-amz-security-token:FQoGZXIvYXdzEKn//////////wEaDOOlIItIhtRakeAyfCLrAWPZXQJFkNrDZNa4Bny102eGKJ5KWD0F+ixFqZaW+A9mgadICpLRxBG4JGUzMtPTDeqxPoLT1qnS6bI/jVmXXUxjVPPMRiXdIlP+li0eFyB/xOK+PN/DOiByee0eu6bjQmkjoC3P5MREvxeanPY7hpgXNO52jSBPo8LMIdAcjCJxyRF7GHZjtZGAMARQWng6DJa9RAiIbxOmXpSbNGpABBVg/TUt8XMUT+p9Lm2Txi10P0ueu1n5rcuxJdBV8Jr/PUF3nZY+/k7MzOPCnzZNqVgpDAQbwby+AVIQcvVwaKsXePqubCqBTHxoh/Mo0ay+3QU=

content-type;host;x-amz-date;x-amz-security-token
ab821ae955788b0e33ebd34c208442ccfc2d406e2edc5e7a39bd6458fbb4f843
[DEBUG]	2018-09-29T15:44:59.181Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	StringToSign:
AWS4-HMAC-SHA256
20180929T154459Z
20180929/us-east-1/sts/aws4_request
7cf0af0e8f55fb1b9c0009104aa8f141097f00fea428ddf1654321e7054a920d
[DEBUG]	2018-09-29T15:44:59.181Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Signature:
c00de0a12c9ee0fce348df452f2833749b854915db58f8d106e3166545a70c43
[DEBUG]	2018-09-29T15:44:59.183Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Sending http request: 
[INFO]	2018-09-29T15:44:59.201Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Starting new HTTPS connection (1): sts.amazonaws.com
[DEBUG]	2018-09-29T15:44:59.628Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	"POST / HTTP/1.1" 200 461
[DEBUG]	2018-09-29T15:44:59.628Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Response headers: {'x-amzn-requestid': '9f421e56-c3fe-11e8-b622-2d5da14a8dc9', 'content-type': 'text/xml', 'content-length': '461', 'date': 'Sat, 29 Sep 2018 15:44:58 GMT'}
[DEBUG]	2018-09-29T15:44:59.640Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Response body:
b'\n \n arn:aws:sts::268133297303:assumed-role/demo-boto3-logging/demo-boto3-logging\n AROAITTVSA67NGZPH2QZI:demo-boto3-logging\n 268133297303\n \n \n 9f421e56-c3fe-11e8-b622-2d5da14a8dc9\n \n\n'
[DEBUG]	2018-09-29T15:44:59.640Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Event needs-retry.sts.GetCallerIdentity: calling handler 
[DEBUG]	2018-09-29T15:44:59.641Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	No retry needed.
[INFO]	2018-09-29T15:44:59.641Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Getting account ID...
[DEBUG]	2018-09-29T15:44:59.641Z	9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Account ID: 268133297303
END RequestId: 9ea3bbef-c3fe-11e8-8eb1-730a799b5405
REPORT RequestId: 9ea3bbef-c3fe-11e8-8eb1-730a799b5405	Duration: 813.73 ms	Billed Duration: 900 ms Memory Size: 128 MB	Max Memory Used: 29 MB

boto3 can be very verbose in DEBUG so I recommend staying at INFO unless you’re actively troubleshooting.

Happy debugging!

Adam

Securing IAM Policies

SecurityIdentityCompliance_GRAYSCALE_IAM

Since the beginning, writing IAM policies with the minimum necessary permissions has been hard. Some services don’t have resource-level permissions (you have to grant to *), but then later they do. When a service has resource-level permissions, it may only be for some of its permissions (the rest still need *). Some services have their own Condition Operators (separate from the global ones) that may or may not help you tighten control. Et cetera. The details are documented differently for each service and it’s a lot of hunting and testing to try to put together a tight policy.

Amazon made it easier! There’s new magic in the IAM UI to help you create policies. It has some limitations, but it’s a big improvement. Here are some of the things it can do that I used to have to do myself:

  • Knows which S3 permissions require the resource list to include a bucket name and which require the bucket name and an object path.StatementSplitting
  • Tries to group permissions and resources into statements when it results in equivalent access (but sometimes ends up granting extra access, see below).StatementGrouping
  • Knows when a service doesn’t support resource-level permissions.ResourceSpecificPermissionsDetection
  • Knows about the Condition Operators specific to each service (not just the global ones).ConditionOperators

There are some limitations:

  • Doesn’t deduplicate. If you add permissions it doesn’t go back and put them into existing statements, it just adds new statements that may duplicate parts of old ones.
  • Only generates JSON, so if you’re writing a YAML CloudFormation template you should translate.
  • Seems to have limited form validation on Condition Operators. You can put in strings that will never match because the API calls for that service can’t contain what you entered (making the statement a no-op).
  • Can end up grouping permissions in a way that makes some resource restrictions meaningless and grants more access than might be expected.TooMuchPermission
  • Sometimes it messes up the syntax. Seems to happen if you don’t put exactly what it expects into the forms.Bug

 

So there are a few problems, but this is still way better than it was before! My plan is to use the visual editor to write policies, then go through and touch it up afterward. Based on what I’ve seen so far, this cuts the time it takes me to develop policies by about 30%.

Happy securing,

Adam

Beating EC2 Security Groups

 

NetworkingContentDelivery_GRAYSCALE_AmazonVPC

Today I’ll show you how to pass traffic through an EC2 Security Group that’s configured not to allow that traffic.

This isn’t esoteric hacking, it’s a detail in the difference between config and state that’s easy to miss when you’re operating an infrastructure.

Like I showed in a previous post, EC2 Security Groups are stateful. They know the difference between the first packet of a new connection and packets that are part of connections that are already established.

This statefulness is why you can let host A SSH to host B just by allowing outgoing SSH on A’s SG and incoming SSH on B’s SG. B doesn’t need to allow outgoing SSH because it knows the return traffic is part of a connection that was already allowed. Similarly for A and incoming SSH.

Here’s the detail of today’s post: if the Security Group sees traffic as part of an established connection, it’ll allow it even if its rules say not to. Ok now let’s break a Security Group.

The Lab

Two hosts, testa and testb. One SG for each, both allowing all outgoing traffic. Testb’s SG allows incoming TCP on port 4321 (a random ephemeral port I’m using for this test):

TrafficAllowed

To test traffic flow, I’m going to use nc. It’s a common Linux utility that sends and receives TCP traffic:

  • Listen: nc -l [port]
  • Send: nc [host] [port]

Test Steps:

(screenshots of shell output below)

  1. Listen on port 4321 on testb.
  2. Start a connection from testa to port 4321 on testb.
  3. Send a message. It’s delivered, as expected.
  4. Remove testb’s SG rule allowing port 4321:TrafficDenied
  5. Send another message through the connection. It will get through! There’s no rule to allow it, but it still gets through.

WAT.

To show nothing else was going on, let’s redo the test with the security group as it is now (no rule allowing 4321).

  1. Quit nc on testa to close the connection. You’ll see it also close on testb.
  2. Listen on port 4321 on testb.
  3. Start a connection from tests a to port 4321 on testb.
  4. Send a message. Not delivered. This time there was no established connection so the traffic was compared to the SGs rules. There was no rule to allow it, so it was denied.

Testb Output

(where we listened)

testb

Only two messages got through.

Testa Output

(where we sent)

testa

We sent three messages. The last two were sent while the SG had the same rules, but the first message was allowed and the second was denied.

Beware!

The rules in EC2 Security Groups don’t apply to open (established) TCP connections. If you need to ensure traffic isn’t flowing between two instances you can’t just remove rules from your SGs. You have to close all open connections.

Happy securing,

Adam

3 Tools to Validate CloudFormation

Hello!

Note: If you just want the script and don’t need the background, go to the gist.

If you found this page, SEO means you probably already found the AWS page on validating CloudFormation templates. If you haven’t, read that first. It’s a better starting place.

I run three tools before applying CF templates.

#1 AWS CLI’s validator

This is the native tool. It’s ok. It’s really only a syntax checker, there are plenty of errors you won’t see until you apply a template to a stack. Still, it’s fast and catches some things.

aws cloudformation validate-template --template-body file://./my_template.yaml

Notes:

  • The CLI has to be configured with access keys or it won’t run the validator.
  • If the template is JSON, this will ignore some requirements (e.g. it’ll allow trailing commas). However, the CF service ignores the same things.

#2 Python’s JSON library

Because the AWS CLI validator ignores some JSON requirements, I like to pass JSON templates through Python’s parser to make sure they’re valid. In the past, I’ve had to do things like load and search templates for unused parameters, etc. That’s not ideal but it’s happened a couple times while doing cleanup and refactoring of legacy code. It’s easier if the JSON is valid JSON.

It’s fiddly to run this in a shell script. I do it with a heredoc so I don’t have to write multiple scripts to the filesystem:

python - <<END
import json
with open('my_template.json') as f:
    json.load(f)
END

Notes:

  • I use Python for this because it’s a dependency of the AWS CLI so I know it’s already installed. You could use jq or another tool, though.
  • I don’t do the YAML equivalent of this because it errors on CF-specific syntax like !Ref.

#3 cfn-nag

This is a linter for CloudFormation. It’s not perfect. I’ve seen it generate false positives like “don’t use * in IAM policy resources” even when * is the only option because it’s all that’s supported by the service I’m writing a policy for. Still, it’s one more way to catch things before you deploy, and it catches some good stuff.

cfn_nag_scan --input-path my_template.yaml

Notes:

  • Annoyingly, this is a Ruby gem so you need a new dependency chain to install it. I highly recommend setting up RVM and creating a gemset to isolate this from your system and other projects (just like you’d do with a Python venv).

Happy automating!

Adam

Terraform: Get Data with Python

Update 2017-12-26: There’s now a more complete, step-by-step example of how to use terraform’s data resource, pip, and this decorator in the source.

Good morning!

Sometimes I have data I need to assemble during terraform’s apply phase, and I like to use Python helper scripts to do that. Awesomely, terraform natively supports using Python to populate the data resource:

data "external" "cars_count" {
  program = ["python", "${path.module}/get_cool_data.py"]

  query = {
    thing_to_count = "cars"
  }
}

output "cars_count" {
  value = "${data.external.cars_count.result.cars}"
}

A slick, easy way to drop out of terraform and use Python to grab what you need (although it can get you in to trouble if you abuse it).

The Python script has to follow a protocol that defines formats, error handling, etc. It’s minimal but it’s fiddly, plus if you need more than one external data script it’s better to modularize than copy and paste, so I wrote a pip-installable package with a decorator that implements the protocol for you:

from terraform_external_data import terraform_external_data

@terraform_external_data
def get_cool_data(query):
    return {query['thing_to_count']: '3'}

if __name__ == '__main__':
    get_cool_data()

And it’s available on PyPI! Just pip install terraform_external_data. Here’s the source.

Happy terraforming,

Adam

EC2: Stateful Statelessness

networkingcontentdelivery_grayscale_amazonvpc.png

Hello!

While studying for an AWS certification, I rediscovered a fiddly networking detail: although ICMP’s ping is stateless, EC2 security groups will pass return ping traffic even when only one direction is defined in their rules. I wanted to see this in action, so I built a lab.

If you just asked, “Wat❓”, keep reading. Skip to the next section if you just want the code.

Background

Network hosts using stateful protocols (like TCP) distinguish between packets that are part of an established connection and packets that are new. For example, when I SSH (which runs on TCP) from A to B:

  1. A asks B to start a new connection.
  2. B agrees.
  3. A and B exchange bunches of packets that are part of the connection they agreed to.
  4. A and B agree to close the connection.

There’s a difference between a new packet and a packet that’s part of an ongoing connection. That means the connection, and its packets, have state (e.g. new vs established). Stateful firewalls (vs stateless) are aware of this:

  1. A ask B to start a new connection.
  2. Firewalls in between allow these packets if there is an explicit rule allowing traffic from A to B.
  3. A and B exchange bunches of packets.
  4. Firewalls in between allow the packets from A to B because of the explicit rule above. However, they allow the return traffic from B to A even if there is no explicit rule to allow it. Since B agreed to the connection the firewall assumes that packets in that connection should be allowed.

In EC2, this is why you only need an outgoing rule on A’s Security Group (SG) and an incoming rule on B’s Security Group to SSH from A to B. EC2 SGs are stateful, and allow the return traffic implicitly.

Ok, here’s the gnarly bit. ICMP (the protocol behind ping) is stateless. Hosts don’t have a negotiation phase where the agree to establish a connection. They just send packets and hope. So, doesn’t that mean I need to write explicit firewall rules in the SGs to allow the return traffic? If the firewall can’t see the state of the connection, it won’t be able to implicitly figure out to allow that traffic, right?

Nope, they infer state based on timeouts and packet types. ICMP pings are ECHO requests answered by ECHO replies. If the SG has seen a request within the timeout, it makes the educated guess that replies are essentially part of “established” connections and allows them. This is what I wanted to see in action.

The Lab

I setup a VPC with two hosts, A (10.0.1.70) and B (10.0.2.35). They’re in different subnets but the ACLs allow all traffic so they don’t influence the test. Here are the SG rules for A (10.0.0.0/16 covers the entire VPC):

test_a_inboundtest_a_outbound

And the rules for B:

test_b_inboundtest_b_outbound

A allows outgoing ICMP to B, and B allows incoming ICMP from A. The return traffic is not allowed by any rules.

The Test Script

I didn’t find a way to send just replies without requests in Linux, so I bodged together a Python script:

"""
This is a stripped-down version of ping that allows you to send a reply without responding a request. This was needed
to test the details of how Amazon EC2 security groups handle state with ICMP traffic. You shouldn't use this for normal
pings.

The ping implementation was based on Samuel Stauffer's python-ping: https://github.com/samuel/python-ping (which only
works with Python 2).

This must be run as root.
You must tell the Linux kernel to ignore ICMP before you run this or it'll eat some of the traffic:
    echo 1 > /proc/sys/net/ipv4/icmp_echo_ignore_all
"""

import argparse, socket, struct, time

def get_arguments():
    parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)
    parser.add_argument('--send-request', metavar='IP_ADDRESS', type=str, help='IP address to send ECHO request.')
    parser.add_argument('--receive', action='store_true', help='Wait for a reply.')
    parser.add_argument('--send-reply', metavar='IP_ADDRESS', type=str, help='IP address to send ECHO reply.')
    return parser.parse_args()

def receive(my_socket):
    while True:
        recPacket, addr = my_socket.recvfrom(1024)
        icmpHeader = recPacket[20:28]
        icmp_type, code, checksum, packetID, sequence = struct.unpack("bbHHh", icmpHeader)
        print('Received type {}.'.format(icmp_type))

def ping(my_socket, dest_addr, icmp_type):
    dest_addr = socket.gethostbyname(dest_addr)
    bytesInDouble = struct.calcsize("d")
    data = (192 - bytesInDouble) * "Q"
    data = struct.pack("d", time.time()) + data
    dummy_checksum = 1 & 0xffff
    dummy_id = 1 & 0xFFFF
    # Header is type (8), code (8), checksum (16), id (16), sequence (16)
    header = struct.pack("bbHHh", icmp_type, 0, socket.htons(dummy_checksum), dummy_id, 1)
    packet = header + data
    my_socket.sendto(packet, (dest_addr, 1))

if __name__ == '__main__':
    args = get_arguments()
    icmp = socket.getprotobyname("icmp")
    my_socket = socket.socket(socket.AF_INET, socket.SOCK_RAW, icmp)
    if args.send_request:
        ping(my_socket, args.send_request, icmp_type=8)  # Type 8 is ECHO request.
    if args.receive:
        receive(my_socket)
    if args.send_reply:
        ping(my_socket, args.send_reply, icmp_type=0)  # Type 0 is ECHO reply.

You can skip reading the code, the important thing is that we can individually choose to listen for packets, send ECHO requests, or send ECHO replies:

python ping.py --help
usage: ping.py [-h] [--send-request IP_ADDRESS] [--receive]
               [--send-reply IP_ADDRESS]

optional arguments:
  -h, --help            show this help message and exit
  --send-request IP_ADDRESS
                        IP address to send ECHO request. (default: None)
  --receive             Wait for a reply. (default: False)
  --send-reply IP_ADDRESS
                        IP address to send ECHO reply. (default: None)<span 				data-mce-type="bookmark" 				id="mce_SELREST_start" 				data-mce-style="overflow:hidden;line-height:0" 				style="overflow:hidden;line-height:0" 			></span>

The Experiment

SSH to each host and tell Linux to ignore ICMP traffic so I can use the script to capture it (see docstring in the script above):

sudo su -
echo 1 > /proc/sys/net/ipv4/icmp_echo_ignore_all
exit

Normal Ping

I send a request from A to B and expect the reply from B to A to be allowed. Here’s what happened:

Remember A is 10.0.1.70 and B is 10.0.2.35.

normal_ping_test_anormal_ping_test_b

Ok, nothing surprising. I sent a request from A to B and started listening on A, a little later I send a reply from B to A and it was allowed. You can do this test with the normal Linux ping command (but not until you tell the kernel to stop ignoring ICMP traffic). This test just validates that my bodged Python actually works.

Reply Only

First we wait a bit. The previous test sent a request from A to B, which started a timer in the SG. Until that timer expires, reply traffic will be allowed. We need to wait for that expiration before this next test is valid.

reply_only_ping_test_areply_only_ping_test_b

Boom! I start listening on A, without sending a request. On B I send a reply to A but it never arrives. The Security Group didn’t allow it. This demonstrates that EC2 Security Groups are inferring the state of ICMP pings by reading their type.

Other Tests

I also tried a couple other things that I’ll leave to you to reproduce in your own lab if you want to see them.

  • Start out like the normal test. Send a request from A to B and start listening on A. Then send several replies from B to A. They’re all allowed. This shows that the SG isn’t counting to ensure it only allows one reply for each request; if it has seen just one request within the timeout it allows replies even if there are multiple.
  • Edit the script above to set the hardcoded ID to be different on A than it is on B. Then nothing works at all. I’m not actually sure what causes this. Could be that the SG is looking at more than just the type, but it could also be in the kernel or the network drivers or somewhere else I haven’t thought of. If you figure it out, message me!

Conclusion

I had free time over the holidays this year! Realistically, understanding this demo isn’t a priority for doing good work on AWS. I just enjoy unwrapping black boxes to see how the parts move.

Be well!

Adam