Touching Production: What does that mean? (Part 1)

Sometimes people ask me what I do all day as an SRE. I usually reply that I touch production every day. But what does touching production even mean? Let me share the typical SRE task of preparation for touching production at eGym (the company I work for).

The Problem Statement

We have a product that allows users to create and upload images of a certain kind. Those images are stored in a bucket on Cloud Storage. The image name is a long randomized string (similar to UUID). The image name is then referenced in a relational database table. At some time in the past, we used very short image names of up to six characters. When we began making the image names part of the external API, we had to rename those legacy images. Longer, better-randomized names are harder to predict and increase the security in case an attacker starts guessing image names. Some images were still using the old names. My task was to migrate those legacy images to longer, hard to predict names while requests were coming in.

Assessing The Scale And Impact

The second thing I usually do is to assess the scale and the expected impact of a task. The first thing is always making sure I understood the problem correctly by talking to the people who issues the request or developed the systems I am about to touch. The scale and the expected impact determine which tools I use and what approaches are feasible. Here I had to understand if we were talking about a month-long migration of data while all systems continue to serve traffic, or if we can apply some database changes in a single transaction and be done in a minute.

I queried the read replica of the production database to get the number of rows that host old-style image names (those with a length of six characters or less):

SELECT COUNT(id) FROM Image where imageType='MY_TYPE' and length(imageId) <= 6;

The result was something around 150k rows. That’s not much. This was a number that I could easily handle in memory on my local machine. From the problem statement, I knew that all new image names have been using much longer, randomized names for a long time. So the dataset we are talking about is stable and not going to change between migration planning and actual migration. A dynamic approach was therefore not needed.

Preparing Metadata Migration

To start development I wanted to have a handy copy of the dataset. I ran the select query again, but this time fetching every row and exporting into a CSV file:

SELECT id, imageId FROM Image where imageType='MY_TYPE' and length(imageId) <= 6;

I peeked into the resulting file to make sure I got the right thing:

$ head dataset.csv 
id,imageId
844365,hytinj
344614,hyt459
460974,hyt8is
834613,hytlf4
832009,hytmps
334627,hytug5
408177,hyt4c4
692956,hyt8u1
874342,hytb7g

I also wanted to make sure I got all the rows. So another sanity check was to count the lines of the CSV file:

$ wc -l < dataset.csv 
  155468

That looked good! Now I wanted to have a new image name, ideally a UUID, for every image. An easy way to do that is to just pipe the file through a custom program that does exactly that. My favorite language is currently Golang, so guess in what language I was writing the tool?

func main() {
  scanner := bufio.NewScanner(os.Stdin)
  for scanner.Scan() {
    fmt.Printf("%v,%v\n", scanner.Text(), uuid.New().String())
  }
  if err := scanner.Err(); err != nil {
    log.Fatal("scan: ", err)
  }
}

This program read from standard input and added a generated UUID to the input line. Something similar to ,1234-567-890. An line reading foo,bar on standard input becomes foo,bar,1234-567-890 on the standard output. This allowed me to create a new CSV file based on the dataset.csv file.

tail -n+2 dataset.csv | go run main.go > dataset-new.csv

Hint: tail -n+2 skips the CSV header line.

Peaking into the output gave me this:

$ head -n 3 dataset-new.csv 
844365,hytinj,cd616cba-52dd-4b81-b358-ed5e5672ae4c
344614,hyt459,88d1debe-4e9e-4482-9c06-b656efadfd62
460974,hyt8is,981d9276-2e93-47b7-962a-4ad35edf995a

The file dataset-new.csv is now the source of truth for how the rows should look like in the future. The only thing that is missing for the database part of this migration is a set of queries that we can apply. Sticking to my preference for small Golang tools I modified the previously used program to look like this:

func main() {
  scanner := bufio.NewScanner(os.Stdin)
  for scanner.Scan() {
    csv := strings.Split(scanner.Text(), ",")
    fmt.Printf("UPDATE `Image` SET `imageId`='%v' WHERE `id`='%v';\n",
      csv[2], csv[0])
  }
  if err := scanner.Err(); err != nil {
    log.Fatal("scan: ", err)
  }
}

This would create SQL queries based on the data in the CSV. I saved the queries in a file for later use:

$ go run main.go < dataset-new.csv > migration.sql

And then I ran the usual sanity checks:

$ wc -l < migration.sql
  155467
$ head -n 3 migration.sql
UPDATE `Image` SET `imageId`='cd616cba-52dd-4b81-b358-ed5e5672ae4c' WHERE `id`='844365';
UPDATE `Image` SET `imageId`='88d1debe-4e9e-4482-9c06-b656efadfd62' WHERE `id`='344614';
UPDATE `Image` SET `imageId`='981d9276-2e93-47b7-962a-4ad35edf995a' WHERE `id`='460974';

That was looking good! The queries for updating the image metadata table in the relational database are done. But the actual files need to be renamed for the reference to be valid.

Preparing Storage Object Migration

Preparing the storage object migration turned out to be a bit more complicated. We not only store the image binary data on Cloud Storage, we also store variations of the file. Those variations have an object name that follows a particular pattern. So for an image named foo we store at least three objects in the bucket:

  • foo: The original
  • foo_b: A variation of the original
  • foo_m: Another type of variation

These variations are all present for all objects that I had to potentially touch. From the documentation, I could also see that there might be just another variation foo_l. However, it was not clear if they are still in the bucket or already deprecated. I had to find this out before I could continue.

I got myself the list of all items in the bucket using the gsutil command:

$ gsutil ls gs://my-bucket/dir1/dir2/ > objects.txt

That yielded a very long list of object paths:

$ head -n 3 objects.txt 
gs://my-bucket/dir1/dir2/<random string>
gs://my-bucket/dir1/dir2/<random string>_b
gs://my-bucket/dir1/dir2/<random string>_m

To skip non-variations I used grep matching on the underscore (which we use in variations only). I piped the result to sed to extract the variation types from the object paths:

$ grep '_' < objects.txt | sed -E 's/^(.*)_(.+)$/\2/'
b
m
b
...

I got a long list of variations. Way too many for a human to check by hand. Since I was only interested in the type of variations, not the number of variations, I used the popular dream team sort and uniq to minimize the dataset:

$ grep '_' < objects.txt |  sed -E 's/^(.*)_(.+)$/\2/' | sort | uniq
b
m

This is for sure not a very efficient way, but on a dataset as small as the one I was dealing with, the whole operation only took a couple of seconds. Luckily, the result showed that I only had to care about the b and m variations. These are the only ones in production currently. Cool!

One thing I had to keep in mind was, that if I changed the image names in the relational database, I also had to change it at the same time on Cloud Storage. There is no such thing as “at the same time” in computing. So I had to have a migration strategy to ensure consistency at all times. The strategy was rather simple, though:

Copy all affected objects to their new names Run the database transaction Remove the old objects after a cool-down period (image names may be cached, we may want to roll back the transaction, you name it…)

I had the SQL queries already. The other two things that were missing were the bucket modifications. Since I wasn’t in a hurry, I chose to just generate myself a shell script that calls gsutils over and over again. Again, this is not a very efficient solution. In SRE, we chose efficiency over simplicity very selectively. As a rule of thumb, you could say: If it fits into memory, consider manipulating it there instead of introducing additional complexity.

Generating the migration scripts was as easy as changing a couple of lines in my little Golang helper program.

func main() {
  scanner := bufio.NewScanner(os.Stdin)
  for scanner.Scan() {
    csv := strings.Split(scanner.Text(), ",")
    fmt.Printf("gsutil cp 'gs://my-bucket/dir1/dir2/%v'   "+
      "'gs://my-bucket/dir1/dir2/%v'\n",
      csv[1], csv[2])
    fmt.Printf("gsutil cp 'gs://my-bucket/dir1/dir2/%v_b' "+
      "'gs://my-bucket/dir1/dir2/%v_b'\n",
      csv[1], csv[2])
    fmt.Printf("gsutil cp 'gs://my-bucket/dir1/dir2/%v_m' "+
      "'gs://my-bucket/dir1/dir2/%v_m'\n",
      csv[1], csv[2])
  }
  if err := scanner.Err(); err != nil {
    log.Fatal("scan: ", err)
  }
}

I ran the program to generate a shell script.

$ go run main.go < dataset-new.csv > step1-copy.sh
$ head -n 3 step1-copy.sh
gsutil cp 'gs://my-bucket/dir1/dir2/hytinj'   'gs://my-bucket/dir1/dir2/cd616cba-52dd-4b81-b358-ed5e5672ae4c'
gsutil cp 'gs://my-bucket/dir1/dir2/hytinj_b' 'gs://my-bucket/dir1/dir2/cd616cba-52dd-4b81-b358-ed5e5672ae4c_b'
gsutil cp 'gs://my-bucket/dir1/dir2/hytinj_m' 'gs://my-bucket/dir1/dir2/cd616cba-52dd-4b81-b358-ed5e5672ae4c_m'

This script can be run from the shell of a maintenance host with access to the production data. I needed the same for the deletion step. At this point you can probably predict what the code will be:

func main() {
  scanner := bufio.NewScanner(os.Stdin)
  for scanner.Scan() {
    csv := strings.Split(scanner.Text(), ",")
    fmt.Printf("gsutil rm 'gs://my-bucket/dir1/dir2/%v'\n", csv[1])
    fmt.Printf("gsutil rm 'gs://my-bucket/dir1/dir2/%v_b'\n", csv[1])
    fmt.Printf("gsutil rm 'gs://my-bucket/dir1/dir2/%v_m'\n", csv[1])
  }
  if err := scanner.Err(); err != nil {
    log.Fatal("scan: ", err)
  }
}

I spare you the output. But it is a list of shell commands in the form of gsutil rm <object>.

Due Diligence

Humans sometimes make mistakes. Humans that work on automation or scripting migrations sometimes create disasters. To avoid disasters (or at least the obvious ones) every piece of code that changes production has to go through a review process on my team. I submitted the files step1-copy.sh, migrations.sql, and step2-remove.sh for review and can’t wait to see what mistakes my fellow engineers will find. They are the best at spotting those. 🧐 Only after scripts and transactions have been reviewed, we actually touch production.

I hope you enjoyed that little peek into what one of the many forms of touching production is prepared.