CF Custom Resources: Avoiding the Two Hour Exception Timeout

ManagementTools_GRAYSCALE_AWSCloudFormation

There’s a gotcha when writing CloudFormation Custom Resources that’s easy to miss and if you miss it your stack can get stuck, ignoring its timeout setting. It’ll fail on its own after an hour, but if it tries to roll back you have to wait a second hour. Here’s how to avoid that.

This post assumes you’re already working with Custom Resources and that yours are backed by lambda.

Here’s an empty custom resource:

import logging
import cfnresponse

def handler(event, context):
    logger = logging.getLogger()
    logger.setLevel(logging.INFO)

    if event['RequestType'] == 'Delete':
        logger.info('Deleted!')
        cfnresponse.send(event, context, cfnresponse.SUCCESS, {})
        return

    logger.info('It worked!')
    cfnresponse.send(event, context, cfnresponse.SUCCESS, {})

It’s a successful no-op:

SuccessfulNoOp

Now let’s add an exception:

import logging
import cfnresponse

def handler(event, context):
    logger = logging.getLogger()
    logger.setLevel(logging.INFO)

    if event['RequestType'] == 'Delete':
        logger.info('Deleted!')
        cfnresponse.send(event, context, cfnresponse.SUCCESS, {})
        return

    raise Exception
    logger.info('It worked!')
    cfnresponse.send(event, context, cfnresponse.SUCCESS, {})

We can see the exception in the logs:

ExceptionThreeRetries

But, then the stack gets stuck because the cfnresponse callback never happened and CF doesn’t know there was a problem:

FailureTimeouts

It took exactly an hour to fail, which suggests CF hit some internal, fallback timeout. My stack timeout was set to five minutes. We can see it retry the lambda function once a minute for three minutes, but then it never tries again in the remaining 57 minutes. I got the same delays in reverse when it tried to roll back (which is really just another update to the previous state). And, since the rollback failed, I had to manually edit the lambda function code and remove the exception to get it to finish rolling back.

Maybe this is a bug? Either way, there’s a workaround.

You should usually only catch specific errors that you know how to handle. It’s an anti-pattern to use except Exception. But, in this case we need to guarantee that the callback always happens. In this one situation (not in general) we need to catch all exceptions:

import logging
import cfnresponse

def handler(event, context):
    logger = logging.getLogger()
    logger.setLevel(logging.INFO)

    try:
        if event['RequestType'] == 'Delete':
            logger.info('Deleted!')
            cfnresponse.send(event, context, cfnresponse.SUCCESS, {})
            return

        raise Exception
        logger.info('It worked!')
        cfnresponse.send(event, context, cfnresponse.SUCCESS, {})
    except Exception:
        logger.exception('Signaling failure to CloudFormation.')
        cfnresponse.send(event, context, cfnresponse.FAILED, {})

Now, the failure is visible to CF and it doesn’t wait:

ExceptionHandled.png

You should use this pattern in every Custom Resource: catch all exceptions and return a FAILED result to CF. You can still catch more specific exceptions inside the catchall try/except, ones specific to the feature you’re implementing, but you need that catchall to ensure the result returns when the unexpected happens.

Happy automating!

Adam

Securing IAM Policies

SecurityIdentityCompliance_GRAYSCALE_IAM

Since the beginning, writing IAM policies with the minimum necessary permissions has been hard. Some services don’t have resource-level permissions (you have to grant to *), but then later they do. When a service has resource-level permissions, it may only be for some of its permissions (the rest still need *). Some services have their own Condition Operators (separate from the global ones) that may or may not help you tighten control. Et cetera. The details are documented differently for each service and it’s a lot of hunting and testing to try to put together a tight policy.

Amazon made it easier! There’s new magic in the IAM UI to help you create policies. It has some limitations, but it’s a big improvement. Here are some of the things it can do that I used to have to do myself:

  • Knows which S3 permissions require the resource list to include a bucket name and which require the bucket name and an object path.StatementSplitting
  • Tries to group permissions and resources into statements when it results in equivalent access (but sometimes ends up granting extra access, see below).StatementGrouping
  • Knows when a service doesn’t support resource-level permissions.ResourceSpecificPermissionsDetection
  • Knows about the Condition Operators specific to each service (not just the global ones).ConditionOperators

There are some limitations:

  • Doesn’t deduplicate. If you add permissions it doesn’t go back and put them into existing statements, it just adds new statements that may duplicate parts of old ones.
  • Only generates JSON, so if you’re writing a YAML CloudFormation template you should translate.
  • Seems to have limited form validation on Condition Operators. You can put in strings that will never match because the API calls for that service can’t contain what you entered (making the statement a no-op).
  • Can end up grouping permissions in a way that makes some resource restrictions meaningless and grants more access than might be expected.TooMuchPermission
  • Sometimes it messes up the syntax. Seems to happen if you don’t put exactly what it expects into the forms.Bug

 

So there are a few problems, but this is still way better than it was before! My plan is to use the visual editor to write policies, then go through and touch it up afterward. Based on what I’ve seen so far, this cuts the time it takes me to develop policies by about 30%.

Happy securing,

Adam

3 Tools to Validate CloudFormation

Hello!

Note: If you just want the script and don’t need the background, go to the gist.

If you found this page, SEO means you probably already found the AWS page on validating CloudFormation templates. If you haven’t, read that first. It’s a better starting place.

I run three tools before applying CF templates.

#1 AWS CLI’s validator

This is the native tool. It’s ok. It’s really only a syntax checker, there are plenty of errors you won’t see until you apply a template to a stack. Still, it’s fast and catches some things.

aws cloudformation validate-template --template-body file://./my_template.yaml

Notes:

  • The CLI has to be configured with access keys or it won’t run the validator.
  • If the template is JSON, this will ignore some requirements (e.g. it’ll allow trailing commas). However, the CF service ignores the same things.

#2 Python’s JSON library

Because the AWS CLI validator ignores some JSON requirements, I like to pass JSON templates through Python’s parser to make sure they’re valid. In the past, I’ve had to do things like load and search templates for unused parameters, etc. That’s not ideal but it’s happened a couple times while doing cleanup and refactoring of legacy code. It’s easier if the JSON is valid JSON.

It’s fiddly to run this in a shell script. I do it with a heredoc so I don’t have to write multiple scripts to the filesystem:

python - <<END
import json
with open('my_template.json') as f:
    json.load(f)
END

Notes:

  • I use Python for this because it’s a dependency of the AWS CLI so I know it’s already installed. You could use jq or another tool, though.
  • I don’t do the YAML equivalent of this because it errors on CF-specific syntax like !Ref.

#3 cfn-nag

This is a linter for CloudFormation. It’s not perfect. I’ve seen it generate false positives like “don’t use * in IAM policy resources” even when * is the only option because it’s all that’s supported by the service I’m writing a policy for. Still, it’s one more way to catch things before you deploy, and it catches some good stuff.

cfn_nag_scan --input-path my_template.yaml

Notes:

  • Annoyingly, this is a Ruby gem so you need a new dependency chain to install it. I highly recommend setting up RVM and creating a gemset to isolate this from your system and other projects (just like you’d do with a Python venv).

Happy automating!

Adam