Touching Production: Review and Change (Part 2)

Two weeks ago I wrote about touching production. I described how I prepared scripts and queries for a migration of image names. The images are stored in Cloud Storage and their object names are referred to in a relational database. I came up with three steps for the migration, all capable of being applied while the site continues to serve the images.

  • Copy old storage objects to new storage objects.
  • Update the table in the relational database to refer to the new name.
  • Remove old storage objects from Cloud Storage.

For the first and the second step I came up with shell scripts. Basically hundreds of thousands of lines calling gsutil, the command line utility to administer Cloud Storage. The second step was a file containing about 150k SQL UPDATE statements.

The Review

It is ~good~ required practice in my team that we review each others work. The systems we manage are incredibly complex and every one of us has a different mental model of how our systems work. And then there is how the systems actually work. 🙃 Reviews are therefore essential to avoid the biggest disasters and keep things running smoothly.

Pushing a change of roughly a million lines through review needs good communication. It is not enough to just drop the files in a pull request and wait for someone to pick it up. So I explained to the reviewer how I came up with the files. What I believe the systems looks like today and how I would like the system to look and behave tomorrow. This may be the most underappreciated part of conducting reviews: Having a chance to synchronize mental models inside SRE and across teams. The commit message is often an executive summary of what has been done and what the overall goal of the change is. However, by pairing up and walking someone through my thought process has not only been an extremely valuable feedback loop for myself but also lead to better code in the end.

Back to the migration change: The reviewer came up with some additional test cases and together we developed a plan for applying the migration scripts. We also had an interesting discussion about whether or not the shell scripts are I/O bound.

The Shell Scripts: Trade-offs

The shell scripts had each ~450k lines of calling gsutil. As far as I knew, gsutil has no batch mode. That’s why I had two options only:

  • Call gsutil, a thoroughly tested and trusted tool again and again. This puts a lot of overhead on the kernel for spawning new processes and context switching between them.
  • Write a tool that repeatedly makes calls to the API, thus implementing the missing batch behavior. This tool would need to get tested thoroughly before being ready for showtime on production.

Our SRE team is small which implies that engineering time is almost always more precious than computing time. That’s why I made the decision to rather spend some compute resources than investing another two or three hours into a custom tool that we would use only once. But how much compute are we talking about here? And what is the bottleneck when we run the scripts? My reviewer suggested it might be I/O bound because gsutil operations often take up to a second. Most of the time is spent waiting for Cloud Storage to return how the operation went. I was under the impression that whenever we would wait for a call to return, we could schedule another process to do it’s magic (for example, starting up).

To find out I created an instance with 64 CPU cores and more than enough memory to fit the processes and data.

We’ll have a look at the step2-remove.sh script, but more or less the same applies for the other shell script, too.

The file’s content looked like this:

~~~plain gsutil rm ‘gs://my-bucket/dir1/dir2/hytinj’ gsutil rm ‘gs://my-bucket/dir1/dir2/hytinj_b’ gsutil rm ‘gs://my-bucket/dir1/dir2/hytinj_m’


In total, the file had 466,401 lines like that.
To distribute the workload on all 64 cores I split the file at 7288 lines, that is 466,401 divided by 64 plus and then incremented by 1 to make up for rounding errors.

~~~~plain
$ split -l 7288 step2-remove.sh step2-remove-sharded.

That gave me 64 files of roughly the same length:

$ ls step2-remove-sharded.*
-rw-r--r--  1 danrl  staff   351K Aug  3 10:35 step2-remove-sharded.aa
-rw-r--r--  1 danrl  staff   351K Aug  3 10:35 step2-remove-sharded.ab
-rw-r--r--  1 danrl  staff   351K Aug  3 10:35 step2-remove-sharded.ac
✂️
-rw-r--r--  1 danrl  staff   351K Aug  3 10:35 step2-remove-sharded.cl

To run them in parallel I looped over them and sending the processes to the background:

$ for FNAME in step2-remove-sharded.*; do sh $FNAME & done

Looking at htop and iftop I have the feeling that the bottleneck really was the CPU here. The poor thing was context switching all the time, desperately waiting for I/O.

htop

As expected, memory and bandwidth usage was rather low. The instance had tens of Gigabytes of memory left unused and could have easily handled 10 GBit/s of network I/O.

iftop

In total, the shell scripts ran for three hours costing us a little less than USD 5. That is orders of magnitude cheaper than any investment in engineering time. Sometimes, a trade-off means that we wouldn’t build the fancy solution but rather throw compute or memory at a one-time problem.

The SQL Script: Managing Risk

The more interesting, because more delicate, part of the migration was running the SQL statements on the live production database. Relational databases are a piece of work… Not necessarily a distributed system designer’s dream but that’s another story.

When the reviewer and I deployed the SQL change, we gradually took more risk as we proceeded. First, we started with a single statement of which we knew it was only affecting an image belonging to a well-known account.

After executing this single statement we ran some tests to see if everything would work as expected, including the caches. Since all tests were green, we were going for ten statements. Then we tested again. We increased to 100 statements, 1k statements, and finally settled with a chunk size of 10k statements for the rest of the migration.

This ramp-up of risk (every change carries some risk) is pretty common when we do changes to production. We like to be able to roll back small changes as early as possible to affect only a few customers. On the other hand, we like to get the job done eventually. We know that engineering time is precious and that hate boring, repeating work. We use this pattern of increasing by orders of magnitude all the time, from traffic management (e.g. 0.1% of users hitting a new release) to migrating storage objects or table rows.

Conclusion

With a hands-on approach and by making reasonable trade-offs, we were able to migrate the legacy image names unnoticed by our users. Once again we touched production without causing a disaster. As we say in my team whenever someone asks us what we do: We touch production, every day, all day long, and sometimes during the night.