A Book from 2017: Stretch Goals and Prescriptions

Happy New Year!

Today’s post is a little outside my usual DevOps geekery, but it’s been an influencer on my work and my career choices this year so I wanted to share it.

For the record, I have zero connections to 3M.

In my teens, I noticed that whenever I bought something with the 3M logo it was noticeably better than the other brands. I didn’t know what 3M was, but this pattern kept repeating and I started to always choose them. Years later, deep inside a career in technology, I was still choosing 3M. I started to ask myself how they did it. Why were all their products better than everyone else’s?

I didn’t know anyone at 3M, so I found a book. The 3M Way to Innovation: Balancing People and Profit.

the3mwaytoinnovation.jpg

Balance? At work? And still better than everyone else? Bring it on.

The book approaches 3M through their innovations. They built hugely successful product lines in everything from sandpaper to projectors, and it turns out other companies have long looked to them as the top standard for the innovation that drives such diverse success. As I worked through the book, one thing really stuck with me: 3M’s definition of Stretch Goals.

I’ve seen a lot of managers ask their teams what can be accomplished in the next unit of time (sprint, quarter, etc.). Often, the team replies with a list that’s shorter than the manager would like. The manager then over-assigns the team by adding items as “stretch goals”. If the team works hard enough and accomplishes enough, they’ll have time to stretch themselves to meet these goals. The outcome I usually see is pressure for teams to work longer hours (with no extra pay) so they can deliver more product (at no extra cost to the company).

This book described 3M’s stretch goals very differently, which I’ll summarize in my own words because it’s characterized throughout the book and there’s no single quote that I think captures it. 3M sets these goals to stretch an aspect of the business that’s needed for it to remain a top competitor, and they’re deliberately ambitious. For example, one that 3M actually used: 30% of annual sales should come from products introduced in the last four years. Goals like these drive innovation because they’re too big to meet with the company’s current practices.

The key difference is that 3M isn’t trying to stretch the capacity of individuals. They’re not trying to increase Scrum points by pushing everyone to work late. They’re setting targets for the company that are impossible to meet unless the teams find new ways to work. They’re driving change by looking for things that can only be done with new approaches; things that can’t be done just by working longer hours. And after they set these goals, they send deeply committed managers out into the trenches to help their teams find and implement these changes. Most of the book is about what happens in those trenches. I highly recommend it.

There’s one other thing from the book I want to highlight: the process of innovation doesn’t simplify into management practices you can choose off a menu. There’s more magic to it than that. It takes skilled leaders and a delicate combination of freedom and pressure to build a company where the best engineers can do their best work, and trying to reduce that to a prescription doesn’t work. Here’s a quote from Dick Lidstad, one of the 3M leaders interviewed for the book, talking about staff from other companies who come to 3M looking to learn some of the innovation practices so they can implement them in their own teams:

They want to take away one or two things that will help them to innovate. … We say that maintaining a climate in which innovation flourishes may be the single biggest factor overall. As the conversation winds down, it becomes clear that what they want is something that is easily transferable. They want specific practices or policies, and get frustrated because they’d like to go away with a clear prescription.

I heard truth in that quote. Despite being a believer in the value of tools like Scrum, which are supposed to foster creativity and innovation, I’ve spent a lot of my career held back by the overhead of process that’s good in principle but applied with too little care to be effective. Ever spent an entire day in Scrum ceremonies? There’s more value in the experience of 3M’s teams overall than there is in any list of process.

This book was written in 2000, but not only has 3M stock continued to perform well, I found many parallels in the stories this author tells and my own experience in the modern tech world. It’s heavy with references and first-hand interviews, and I think it’s a valuable read for anyone in tech today.

If you read it, let me know what you think!

Adam

Terraform: Get Data with Python

Update 2017-12-26: There’s now a more complete, step-by-step example of how to use terraform’s data resource, pip, and this decorator in the source.

Good morning!

Sometimes I have data I need to assemble during terraform’s apply phase, and I like to use Python helper scripts to do that. Awesomely, terraform natively supports using Python to populate the data resource:

data "external" "cars_count" {
  program = ["python", "${path.module}/get_cool_data.py"]

  query = {
    thing_to_count = "cars"
  }
}

output "cars_count" {
  value = "${data.external.cars_count.result.cars}"
}

A slick, easy way to drop out of terraform and use Python to grab what you need (although it can get you in to trouble if you abuse it).

The Python script has to follow a protocol that defines formats, error handling, etc. It’s minimal but it’s fiddly, plus if you need more than one external data script it’s better to modularize than copy and paste, so I wrote a pip-installable package with a decorator that implements the protocol for you:

from terraform_external_data import terraform_external_data

@terraform_external_data
def get_cool_data(query):
    return {query['thing_to_count']: '3'}

if __name__ == '__main__':
    get_cool_data()

And it’s available on PyPI! Just pip install terraform_external_data. Here’s the source.

Happy terraforming,

Adam

EC2: Stateful Statelessness

networkingcontentdelivery_grayscale_amazonvpc.png

Hello!

While studying for an AWS certification, I rediscovered a fiddly networking detail: although ICMP’s ping is stateless, EC2 security groups will pass return ping traffic even when only one direction is defined in their rules. I wanted to see this in action, so I built a lab.

If you just asked, “Wat❓”, keep reading. Skip to the next section if you just want the code.

Background

Network hosts using stateful protocols (like TCP) distinguish between packets that are part of an established connection and packets that are new. For example, when I SSH (which runs on TCP) from A to B:

  1. A asks B to start a new connection.
  2. B agrees.
  3. A and B exchange bunches of packets that are part of the connection they agreed to.
  4. A and B agree to close the connection.

There’s a difference between a new packet and a packet that’s part of an ongoing connection. That means the connection, and its packets, have state (e.g. new vs established). Stateful firewalls (vs stateless) are aware of this:

  1. A ask B to start a new connection.
  2. Firewalls in between allow these packets if there is an explicit rule allowing traffic from A to B.
  3. A and B exchange bunches of packets.
  4. Firewalls in between allow the packets from A to B because of the explicit rule above. However, they allow the return traffic from B to A even if there is no explicit rule to allow it. Since B agreed to the connection the firewall assumes that packets in that connection should be allowed.

In EC2, this is why you only need an outgoing rule on A’s Security Group (SG) and an incoming rule on B’s Security Group to SSH from A to B. EC2 SGs are stateful, and allow the return traffic implicitly.

Ok, here’s the gnarly bit. ICMP (the protocol behind ping) is stateless. Hosts don’t have a negotiation phase where the agree to establish a connection. They just send packets and hope. So, doesn’t that mean I need to write explicit firewall rules in the SGs to allow the return traffic? If the firewall can’t see the state of the connection, it won’t be able to implicitly figure out to allow that traffic, right?

Nope, they infer state based on timeouts and packet types. ICMP pings are ECHO requests answered by ECHO replies. If the SG has seen a request within the timeout, it makes the educated guess that replies are essentially part of “established” connections and allows them. This is what I wanted to see in action.

The Lab

I setup a VPC with two hosts, A (10.0.1.70) and B (10.0.2.35). They’re in different subnets but the ACLs allow all traffic so they don’t influence the test. Here are the SG rules for A (10.0.0.0/16 covers the entire VPC):

test_a_inboundtest_a_outbound

And the rules for B:

test_b_inboundtest_b_outbound

A allows outgoing ICMP to B, and B allows incoming ICMP from A. The return traffic is not allowed by any rules.

The Test Script

I didn’t find a way to send just replies without requests in Linux, so I bodged together a Python script:

"""
This is a stripped-down version of ping that allows you to send a reply without responding a request. This was needed
to test the details of how Amazon EC2 security groups handle state with ICMP traffic. You shouldn't use this for normal
pings.

The ping implementation was based on Samuel Stauffer's python-ping: https://github.com/samuel/python-ping (which only
works with Python 2).

This must be run as root.
You must tell the Linux kernel to ignore ICMP before you run this or it'll eat some of the traffic:
    echo 1 > /proc/sys/net/ipv4/icmp_echo_ignore_all
"""

import argparse, socket, struct, time

def get_arguments():
    parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)
    parser.add_argument('--send-request', metavar='IP_ADDRESS', type=str, help='IP address to send ECHO request.')
    parser.add_argument('--receive', action='store_true', help='Wait for a reply.')
    parser.add_argument('--send-reply', metavar='IP_ADDRESS', type=str, help='IP address to send ECHO reply.')
    return parser.parse_args()

def receive(my_socket):
    while True:
        recPacket, addr = my_socket.recvfrom(1024)
        icmpHeader = recPacket[20:28]
        icmp_type, code, checksum, packetID, sequence = struct.unpack("bbHHh", icmpHeader)
        print('Received type {}.'.format(icmp_type))

def ping(my_socket, dest_addr, icmp_type):
    dest_addr = socket.gethostbyname(dest_addr)
    bytesInDouble = struct.calcsize("d")
    data = (192 - bytesInDouble) * "Q"
    data = struct.pack("d", time.time()) + data
    dummy_checksum = 1 & 0xffff
    dummy_id = 1 & 0xFFFF
    # Header is type (8), code (8), checksum (16), id (16), sequence (16)
    header = struct.pack("bbHHh", icmp_type, 0, socket.htons(dummy_checksum), dummy_id, 1)
    packet = header + data
    my_socket.sendto(packet, (dest_addr, 1))

if __name__ == '__main__':
    args = get_arguments()
    icmp = socket.getprotobyname("icmp")
    my_socket = socket.socket(socket.AF_INET, socket.SOCK_RAW, icmp)
    if args.send_request:
        ping(my_socket, args.send_request, icmp_type=8)  # Type 8 is ECHO request.
    if args.receive:
        receive(my_socket)
    if args.send_reply:
        ping(my_socket, args.send_reply, icmp_type=0)  # Type 0 is ECHO reply.

You can skip reading the code, the important thing is that we can individually choose to listen for packets, send ECHO requests, or send ECHO replies:

python ping.py --help
usage: ping.py [-h] [--send-request IP_ADDRESS] [--receive]
               [--send-reply IP_ADDRESS]

optional arguments:
  -h, --help            show this help message and exit
  --send-request IP_ADDRESS
                        IP address to send ECHO request. (default: None)
  --receive             Wait for a reply. (default: False)
  --send-reply IP_ADDRESS
                        IP address to send ECHO reply. (default: None)<span 				data-mce-type="bookmark" 				id="mce_SELREST_start" 				data-mce-style="overflow:hidden;line-height:0" 				style="overflow:hidden;line-height:0" 			></span>

The Experiment

SSH to each host and tell Linux to ignore ICMP traffic so I can use the script to capture it (see docstring in the script above):

sudo su -
echo 1 > /proc/sys/net/ipv4/icmp_echo_ignore_all
exit

Normal Ping

I send a request from A to B and expect the reply from B to A to be allowed. Here’s what happened:

Remember A is 10.0.1.70 and B is 10.0.2.35.

normal_ping_test_anormal_ping_test_b

Ok, nothing surprising. I sent a request from A to B and started listening on A, a little later I send a reply from B to A and it was allowed. You can do this test with the normal Linux ping command (but not until you tell the kernel to stop ignoring ICMP traffic). This test just validates that my bodged Python actually works.

Reply Only

First we wait a bit. The previous test sent a request from A to B, which started a timer in the SG. Until that timer expires, reply traffic will be allowed. We need to wait for that expiration before this next test is valid.

reply_only_ping_test_areply_only_ping_test_b

Boom! I start listening on A, without sending a request. On B I send a reply to A but it never arrives. The Security Group didn’t allow it. This demonstrates that EC2 Security Groups are inferring the state of ICMP pings by reading their type.

Other Tests

I also tried a couple other things that I’ll leave to you to reproduce in your own lab if you want to see them.

  • Start out like the normal test. Send a request from A to B and start listening on A. Then send several replies from B to A. They’re all allowed. This shows that the SG isn’t counting to ensure it only allows one reply for each request; if it has seen just one request within the timeout it allows replies even if there are multiple.
  • Edit the script above to set the hardcoded ID to be different on A than it is on B. Then nothing works at all. I’m not actually sure what causes this. Could be that the SG is looking at more than just the type, but it could also be in the kernel or the network drivers or somewhere else I haven’t thought of. If you figure it out, message me!

Conclusion

I had free time over the holidays this year! Realistically, understanding this demo isn’t a priority for doing good work on AWS. I just enjoy unwrapping black boxes to see how the parts move.

Be well!

Adam

AWS Certification and Networking

Hello!

Recently I’ve been working on the AWS Certification exams, and I’ve found they require much deeper understanding of networking on the platform than I had. For example, ICMP is a stateless protocol, so to ping between two servers do you need ingress and egress rules on both Security Groups? I knew from past experience with iptables that the answer varies by setup, but I didn’t know how it worked in EC2.

For me, gnarly networking is easiest to learn hands-on. Docs get me part of the way but I really need to engineer it myself before I’ll remember it. To prep for certification I ended up building a sandbox environment in my AWS account where I could play around. It took some doing; many AWS patterns come pre-baked with Security Groups, ACLs, etc. that make everything work, but I wanted everything turned off so I could verify what was really needed for different traffic flows. If I delete the egress rule on one side of a connection, does traffic still flow? Hard to validate if there are broad, generic rules in place. Easy to validate if only exactly what’s needed is present.

Since it was tricky, I published the automation for the sandbox I’ve been using. If you want to do your own deep dive of networking in AWS, hopefully this will help you out.

github.com/operatingops/aws_study

diagram

Happy Operating!

Adam

Production-ready Scripts and Python

Production is hard. Even a simple script that looks up queue state and sends it to an API gets complex in prod. Without tests, the divide by zero case you missed will mask queue overloads. Someone won’t see that required argument you didn’t enforce and break everything when they accidentally publish a null value. You’ll forget to timestamp one of your output lines and then when the queue goes down you won’t be able to correlate queue status to network events.

Python can help! Out of box it can give you essential but often-skipped features, like these:

  • Automated tests for multiple platforms.
  • A --simulate option.
  • Command line sanity like a --help option and enforcement of required arguments.
  • Informative log output.
  • An easy way to build and package.
  • An easy way to install a build without a git clone.
  • A command that you can just run like any other command. No weird shell setup or invocation required.

It can be a little tricky, though, if you haven’t already done it. So I wrote a project that demonstrates it for you. It includes an example of a script that isn’t ready for prod.

Hopefully this will save you from some of the many totally avoidable, horrible problems that bad scripts have caused in my prods.

Thanks for reading!

Adam