There’s a gotcha when writing CloudFormation Custom Resources that’s easy to miss and if you miss it your stack can get stuck, ignoring its timeout setting. It’ll fail on its own after an hour, but if it tries to roll back you have to wait a second hour. Here’s how to avoid that.
This post assumes you’re already working with Custom Resources and that yours are backed by lambda.
Here’s an empty custom resource:
import logging import cfnresponse def handler(event, context): logger = logging.getLogger() logger.setLevel(logging.INFO) if event['RequestType'] == 'Delete': logger.info('Deleted!') cfnresponse.send(event, context, cfnresponse.SUCCESS, {}) return logger.info('It worked!') cfnresponse.send(event, context, cfnresponse.SUCCESS, {})
It’s a successful no-op:
Now let’s add an exception:
import logging import cfnresponse def handler(event, context): logger = logging.getLogger() logger.setLevel(logging.INFO) if event['RequestType'] == 'Delete': logger.info('Deleted!') cfnresponse.send(event, context, cfnresponse.SUCCESS, {}) return raise Exception logger.info('It worked!') cfnresponse.send(event, context, cfnresponse.SUCCESS, {})
We can see the exception in the logs:
But, then the stack gets stuck because the cfnresponse
callback never happened and CF doesn’t know there was a problem:
It took exactly an hour to fail, which suggests CF hit some internal, fallback timeout. My stack timeout was set to five minutes. We can see it retry the lambda function once a minute for three minutes, but then it never tries again in the remaining 57 minutes. I got the same delays in reverse when it tried to roll back (which is really just another update to the previous state). And, since the rollback failed, I had to manually edit the lambda function code and remove the exception to get it to finish rolling back.
Maybe this is a bug? Either way, there’s a workaround.
You should usually only catch specific errors that you know how to handle. It’s an anti-pattern to use except Exception
. But, in this case we need to guarantee that the callback always happens. In this one situation (not in general) we need to catch all exceptions:
import logging import cfnresponse def handler(event, context): logger = logging.getLogger() logger.setLevel(logging.INFO) try: if event['RequestType'] == 'Delete': logger.info('Deleted!') cfnresponse.send(event, context, cfnresponse.SUCCESS, {}) return raise Exception logger.info('It worked!') cfnresponse.send(event, context, cfnresponse.SUCCESS, {}) except Exception: logger.exception('Signaling failure to CloudFormation.') cfnresponse.send(event, context, cfnresponse.FAILED, {})
Now, the failure is visible to CF and it doesn’t wait:
You should use this pattern in every Custom Resource: catch all exceptions and return a FAILED result to CF. You can still catch more specific exceptions inside the catchall try/except, ones specific to the feature you’re implementing, but you need that catchall to ensure the result returns when the unexpected happens.
Happy automating!
Adam