CF Custom Resources: Avoiding the Two Hour Exception Timeout

ManagementTools_GRAYSCALE_AWSCloudFormation

There’s a gotcha when writing CloudFormation Custom Resources that’s easy to miss and if you miss it your stack can get stuck, ignoring its timeout setting. It’ll fail on its own after an hour, but if it tries to roll back you have to wait a second hour. Here’s how to avoid that.

This post assumes you’re already working with Custom Resources and that yours are backed by lambda.

Here’s an empty custom resource:

import logging
import cfnresponse

def handler(event, context):
    logger = logging.getLogger()
    logger.setLevel(logging.INFO)

    if event['RequestType'] == 'Delete':
        logger.info('Deleted!')
        cfnresponse.send(event, context, cfnresponse.SUCCESS, {})
        return

    logger.info('It worked!')
    cfnresponse.send(event, context, cfnresponse.SUCCESS, {})

It’s a successful no-op:

SuccessfulNoOp

Now let’s add an exception:

import logging
import cfnresponse

def handler(event, context):
    logger = logging.getLogger()
    logger.setLevel(logging.INFO)

    if event['RequestType'] == 'Delete':
        logger.info('Deleted!')
        cfnresponse.send(event, context, cfnresponse.SUCCESS, {})
        return

    raise Exception
    logger.info('It worked!')
    cfnresponse.send(event, context, cfnresponse.SUCCESS, {})

We can see the exception in the logs:

ExceptionThreeRetries

But, then the stack gets stuck because the cfnresponse callback never happened and CF doesn’t know there was a problem:

FailureTimeouts

It took exactly an hour to fail, which suggests CF hit some internal, fallback timeout. My stack timeout was set to five minutes. We can see it retry the lambda function once a minute for three minutes, but then it never tries again in the remaining 57 minutes. I got the same delays in reverse when it tried to roll back (which is really just another update to the previous state). And, since the rollback failed, I had to manually edit the lambda function code and remove the exception to get it to finish rolling back.

Maybe this is a bug? Either way, there’s a workaround.

You should usually only catch specific errors that you know how to handle. It’s an anti-pattern to use except Exception. But, in this case we need to guarantee that the callback always happens. In this one situation (not in general) we need to catch all exceptions:

import logging
import cfnresponse

def handler(event, context):
    logger = logging.getLogger()
    logger.setLevel(logging.INFO)

    try:
        if event['RequestType'] == 'Delete':
            logger.info('Deleted!')
            cfnresponse.send(event, context, cfnresponse.SUCCESS, {})
            return

        raise Exception
        logger.info('It worked!')
        cfnresponse.send(event, context, cfnresponse.SUCCESS, {})
    except Exception:
        logger.exception('Signaling failure to CloudFormation.')
        cfnresponse.send(event, context, cfnresponse.FAILED, {})

Now, the failure is visible to CF and it doesn’t wait:

ExceptionHandled.png

You should use this pattern in every Custom Resource: catch all exceptions and return a FAILED result to CF. You can still catch more specific exceptions inside the catchall try/except, ones specific to the feature you’re implementing, but you need that catchall to ensure the result returns when the unexpected happens.

Happy automating!

Adam

Pear: A Better Way to Deploy to AWS

Update 29 September 2018: Since the release of AWS Step Functions, this pattern is out of date. Orchestrating deployment with a slim lambda function is still a solid pattern, but Step Functions could make the implementation simpler and could provide more features (like graphical display of deployment status). You can even see some of the ways this might work in AWS’s DevOps blog. Those implementation changes would drive some re-architecture, too. Because those are major changes, for now Pear is an interesting exploration of a pattern but isn’t ready for use in new deployments.

A while back I wanted to put a voice interface in front of my deployment automation. I think passwords on voice interfaces are annoying and aren’t secure, so I wanted an unauthenticated system. I didn’t want to lay awake at night worried about all the things people could break with it, so I set out to engineer a deployment infrastructure that I could put behind voice control and still feel was secure and reliable.

After a lot of learning (the journey towards this is where the Life Coach and the Better Alexa Quick Start came from) and several revisions, I had built my new infrastructure. For reasons I can’t remember, I called it Pear.

Pear is designed to make it easy to add slick features like voice interfaces while giving you enough control to stay secure and enough stability to operate in production. It’s meant for infrastructures too complex to fit in tools like Heroku or Elastic Beanstalk, for situations where you need to be able to turn all the knobs. I think it achieves those goals, so I decided to publish it. Check out the repo for an example implementation and more complete documentation of its features, but to give you a taste here’s a diagram of the basic architecture:

pear architecture

Cheers!

Adam