SREcon Europe (Report)

What an exhausting conference! So many learnings, valuable conversations, and interesting workshops.

Day One: Roundup

We all started the conference together, in the large ballroom. I learned from Theo Schlossnagle and Heinrich Hartmann that data ingestion at Circonus never stops and that they had to apply impressive engineering to handle the massive load that billions of time series produce.

Afterwards, Corey Quinn and John Looney entered the stage with their Song of Ice and TireFire. I’d rather not spoiler you on this one. Suffice is to say, we had many laughs! It is a relaxing, popcorn-type must-see talk.

The third talk of the day, delivered by Simon McGarr, was about GDPR. Significantly less laughter in that one. Oops! The smell of metaphorical dead bodies filled the room. I can recommend the talk and if only half of it is applicable to your company or product, you won’t be laughing for a while to come. Phew. I am still undecided what to think about GDPR in general.

After the opening talks, I spent most of day one in workshops. Since I missed it in Santa Clara earlier this year, I joined the How to design a distributed system in 3 hous workshop held by Google folks. The workshop included an exercise in Non-abstract Large-scale System Design (see chapter 12 of The Site Reliability Workbook). This is where my SRE flash cards came in handy. I use the SRE flash cards to stay in the game of system design because I have a hard time remembering all the numbers.

Day Two: Dealing with Dark Debt: Lessons Learnt at Goldman Sachs

I met Vanessa Yiu at the speakers reception on Monday. She was, like me, very excited because it was her first speaking engagement in the SRE community. Her talk was perfectly delivered and the slides were exceptionally good. I felt very happy for her, because clearly she had an awesome debut in the community. The talk itself surprised me, though.

One out of three employees in Goldman Sachs is an engineer, including SRE. Woot? Yes, Goldman Sachs is a very technical company. I did not know that. On Dark Debt: In contrast to tech debt, dark debt is not recognizable at the time of creation. So we probably won’t have a tech debt tracking ticket for that in our board. Dark debt is the unforeseen tech debt, if you will. The name is derived from dark matter. Dark matter has effects on its environment, but one cannot see dark matter (because it eats the light). Similar to dark matter, which interacts with its environment, dark debt interacts with hardware and software in a distributed system in an unforeseen way.

Vanessa told us about tactics to manage Dark Debt:

  • Prevention
  • Insight into the environment
  • Detection
  • Diagnose
  • Culture

I took some notes on each.

Prevention

Build sustainable ecosystems. Easier said than done, right? Inhibit the creation of tech debt to begin with. Goldman Sachs has a thingy called SecDb, a central risk management platform. Basically an object-oriented dependency graph. It has its own language and IDE, does risk assessment for financial products, and has its own securities programming language called Slang. A lot of that sounded awesome, but I happen to know that programmers not necessarily love slang. A source that does not want to be named, told me that newbies are often given slang tickets and gain mostly non-transferable skills by that. The same source told me, that SecDb is not loved by everyone and that the change process is painfully slow. Please take this with a grain of salt, as I cannot verify any of that. 🙈🙊🐵

Back to topic: So, SecDb was created 25 years ago and is still evolving on a daily basis. Vanessa said, SecDb is basically two weeks old. How did they do that? The development process was very transparent and collaborative from the beginning. Every developer can see the entire source code. Everyone can execute the whole thing locally. Everyone can fix bugs and develop shared tools around SecDb. It has a strong focus on re-use. There is also a form of gamification to improve the code base: Developers get points for adding or removing code. However, more points are awarded for removing code.

And get this: Bots go through the code and remove parts that have not been used for a while. Triggering unused code is such a risk in the securities field, that they automated removing unused code. Wondering how that works in practice. 🤨 From Q&A: The bot flags the code and a human does the actual work then. I got that wrong! See Vanessa’s message below. The bot does remove the code! How cool is that?

Insight

With proper insight into the production environment, dark debt is easier to detect. I took a lot of notes here, but when I went over the notes, I figured that this is more or less just a monitoring/instrumentation framework that you can find in every other large organisation. Goldman Sachs is a Java shop, by the way. So think central collection of JRE metrics.

Diagnose

Controlled chaos is good for you. Goldman Sachs does fault injection and chaos engineering. Of course, not everywhere. However, new systems do get chaos engineered from the very beginning. For example, in production (see clarification below) there is always number of orders (e.g. buy/sell orders) that are rejected at random to make sure all retry-logic in all clients is in good shape. Wondering how that works with time-critical stock orders? Microsecond trading anyone? I guess I have no idea what exactly Goldman Sachs does, to be honest.

Visualise

Use tracing! Vanessa presented a couple of traces that nicely showed how visualisation can help spot problems much better than any log file. Can confirm. We recently added tracing to our most important artifacts and we gained a ton of new insights by that alone. By that, we found bugs before they hit a significant number of users. I absolutely agree: Use tracing. Start now.

Culture

Don’t play whack-a-mole. Don’t jump to patching every single edge case. The system will change and edge cases may become obsolete. It might be a bad investment to throw engineering time at edge cases. Say No to dark matter developers: No one should be developing in isolation or keep knowledge in their heads. Do not ignore technical debt. At least track it. Increase runtime transparency. Transparency is good for you. (It has electrolytes) Practice blameless post mortems (also obvious). Emphasis on blameless, here. Share knowledge by pair programming and peer code reviews. SRE hackathons for refactoring or dedicated sprints. Reserve exclusive time to fix the things on the wishlist of SRE.

Update Sep 1st, 2018: Apparently, I got a couple of things wrong in my notes. Vanessa helped me out. With her permission, here’s a copy of a message I received:

Hi Dan! Thanks for attending and writing up a summary of my talk :) I really appreciate you taking the time.

I just wanted to help clarify a few points.

For the chaos engineering part, we don’t do this for business facing/trading systems. The example I referred to is for an internal facing infrastructure provisioning system. We reject a fixed percentage of provisioning orders in the production flow at random there.

For business facing/ trading systems we typically do fault injection and stress tests only. We’ve been doing this for a long time to ensure we have confidence in both system capacity and our business logic/controls (e.g. if market suddenly swings and generate 3x volume, or if someone fat fingered a trade and put in a completely wrong price or quantity.)

With regards to SecDb, the bot does actually remove the code too :) The human just has to approve the code review that the bot raise, more as a control/audit than anything else. If the code reviewer says yes, then the bot removes the code and push update automatically.

For JRM, I probably didn’t explain myself well enough on this one… the key point wasn’t so much the monitoring or what gets monitored, but the fact that the actual application monitoring is decoupled from application logic, and the monitoring config is also decoupled from the monitoring agent. ie. each of those three things can have different release cycles of each other. I will have a think on how to reword and express that better!

I really enjoyed the whole experience and I hope to see you at future SREcons again! :D

Best wishes, Vanessa

I apologize for the misunderstandings and hope they did not cause any trouble! Thank you very much Vanessa for taking the time to clarify things!

This shows how important it is to go and watch the video if one is interested in the whole story. Don’t ever trust my notes :) Even I don’t trust them fully.

Day Two: Know Your Kubernetes Deploys

As a retired infosec person, I do enjoy hearing about the progress the field is making. Especially in the Kubernetes realm. We all know that Kubernetes is the new computing stack, right? Whatever your opinion is on that, you might like Felix Glaser’s excellent talk about Shopify’s production security engineering efforts in deploying trusted images. Production Security Engineering at Shopify takes care about everything that happens below the application level. So we are talking Docker and containers here.

Felix argued that FROM foo:latest is the new curl | sudo sh. I could not agree more.

How to fix it, then?

Shopify has a gate service which decides which images are OK to pull and run. This service is called Kritis. Kritis is basically an admission controller for gating deployments to only use signed images. Rogue admins can not deploy unsigned images anymore. Shopify wrote an attestor (that’s a term from the binauthz realm) which is called Voucher. Voucher runs a couple of checks on images before they are admitted into production.

This begs the question, what to do in emegency situations and incidents? When we quickly need to deploy and cannot wait for all the checks and reviews? Turns out, one can still deploy if there is a special “break glass” annotation in the Kubernetes deployment. However, that immediately triggers a page to the cloud security team of shopify. Then a security engineer jumps in to help with root cause analysis. Or to defend against the attacker.

Cool thing!

Day Two: How We Un-Scattered Our DNS Setup and Unlocked New Automation Options

This was the talk I was most excited about. Not sure why. However, I was so busy during the talk that I could not take any notes. But here is a long write-up of the talk. 😇

Day Two: Managing Misfortune for Best Results

We all know the Wheel Of Misfortune, an exercise that Google does its SREs to keep their intuitions sharp. The Managing Misfortune for Best Results talk was about how to design and deliver those scenarios.

A couple of factors make a successful team. These include (in that order!):

  • Psychological safety
  • Dependability
  • Structure and clarity
  • Meaning. Very related to job satisfaction.
  • Impact. Delivering value for Google is interestingly the least important factor. Huh?

This matches what Google published at re:Work. Re:Work is, BTW, a highly recommended read! I love that page!

The goal is to deliver a high value training experience. A carefully calibrated stress load. Carefully, because we want teams to survive the training.

Regarding scenario selection: As a trainer, you have to order the learning path. Select scenarios covering recent study areas of the team. Calbrate difficulty with experience (not every team gets the same training).

For cross team exercises: Google follows the IMAG protocol (Incident Management at Google). Up to 2016, different teams had different incident management protocols. Some went straight to IRC, others to a shared doc. Different teams have different habits and culture in handling incidents. Account for that.

Monitoring bookmarks. If your monitoring system provides bookmarking functionality, make use of that. Instead of the dungeon master saying “red line is going up” just link to a graph that represents that. Or share a screenshot. The more real it looks, the better.

Maintain playbooks of useful outages. Maintain a list of outages and re-use for each team member. This applies to conceptual outages, e.g. a bad binary or bad data hitting a server.

Tips for the training session: Someone should transcribe the session. A log of the exercise basically. This helps in the debriefing, because it provides some data of what was done in response to what. The log should be shared after the session.

The talk was quickly over. But then, instead of a long Q&A, a volunteer got onto the stage and the speaker ran a fictional exercise with that person. Kudos to the person. brave move! Have a look at the video once it is out. It was really interesting!

Day Two: Food!?

By the end of day two the never ending supply of food made me think: When did they stop serving us food and began feeding us? And why?

Day Three: Roundup

Another day that I focussed on workshops. Later I had some other things to take care of and missed some of the talks or did not have time to take notes.

I remember the Delete This: Decommissioning Servers at Scale talk by Anirudh Ra from Facebook being very funny. I could feel the pain of having to drain machines in every single sentence. My colleagues and I had an awesome time listening to this talk. We may have our own story with machines not being drained in time. 😥

Conclusion

This time I had soooo many highly appreciated conversations that I almost forgot to take notes. We also had a production incident that I followed remotely to the extent possible. On top of that, I had some other things to take care of. Nevertheless, I learned a ton of new things and got to know more people from the community.

Thanks y’all and see you again in Brooklyn next year!

Special thanks fly out to Nora, my mentee and most critical spell checker. :)

Touching Production: Review and Change (Part 2)

Two weeks ago I wrote about touching production. I described how I prepared scripts and queries for a migration of image names. The images are stored in Cloud Storage and their object names are referred to in a relational database. I came up with three steps for the migration, all capable of being applied while the site continues to serve the images.

  • Copy old storage objects to new storage objects.
  • Update the table in the relational database to refer to the new name.
  • Remove old storage objects from Cloud Storage.

For the first and the second step I came up with shell scripts. Basically hundreds of thousands of lines calling gsutil, the command line utility to administer Cloud Storage. The second step was a file containing about 150k SQL UPDATE statements.

The Review

It is ~good~ required practice in my team that we review each others work. The systems we manage are incredibly complex and every one of us has a different mental model of how our systems work. And then there is how the systems actually work. 🙃 Reviews are therefore essential to avoid the biggest disasters and keep things running smoothly.

Pushing a change of roughly a million lines through review needs good communication. It is not enough to just drop the files in a pull request and wait for someone to pick it up. So I explained to the reviewer how I came up with the files. What I believe the systems looks like today and how I would like the system to look and behave tomorrow. This may be the most underappreciated part of conducting reviews: Having a chance to synchronize mental models inside SRE and across teams. The commit message is often an executive summary of what has been done and what the overall goal of the change is. However, by pairing up and walking someone through my thought process has not only been an extremely valuable feedback loop for myself but also lead to better code in the end.

Back to the migration change: The reviewer came up with some additional test cases and together we developed a plan for applying the migration scripts. We also had an interesting discussion about whether or not the shell scripts are I/O bound.

The Shell Scripts: Trade-offs

The shell scripts had each ~450k lines of calling gsutil. As far as I knew, gsutil has no batch mode. That’s why I had two options only:

  • Call gsutil, a thoroughly tested and trusted tool again and again. This puts a lot of overhead on the kernel for spawning new processes and context switching between them.
  • Write a tool that repeatedly makes calls to the API, thus implementing the missing batch behavior. This tool would need to get tested thoroughly before being ready for showtime on production.

Our SRE team is small which implies that engineering time is almost always more precious than computing time. That’s why I made the decision to rather spend some compute resources than investing another two or three hours into a custom tool that we would use only once. But how much compute are we talking about here? And what is the bottleneck when we run the scripts? My reviewer suggested it might be I/O bound because gsutil operations often take up to a second. Most of the time is spent waiting for Cloud Storage to return how the operation went. I was under the impression that whenever we would wait for a call to return, we could schedule another process to do it’s magic (for example, starting up).

To find out I created an instance with 64 CPU cores and more than enough memory to fit the processes and data.

We’ll have a look at the step2-remove.sh script, but more or less the same applies for the other shell script, too.

The file’s content looked like this:

gsutil rm 'gs://my-bucket/dir1/dir2/hytinj'
gsutil rm 'gs://my-bucket/dir1/dir2/hytinj_b'
gsutil rm 'gs://my-bucket/dir1/dir2/hytinj_m'

In total, the file had 466,401 lines like that. To distribute the workload on all 64 cores I split the file at 7288 lines, that is 466,401 divided by 64 plus and then incremented by 1 to make up for rounding errors.

$ split -l 7288 step2-remove.sh step2-remove-sharded.

That gave me 64 files of roughly the same length:

$ ls step2-remove-sharded.*
-rw-r--r--  1 danrl  staff   351K Aug  3 10:35 step2-remove-sharded.aa
-rw-r--r--  1 danrl  staff   351K Aug  3 10:35 step2-remove-sharded.ab
-rw-r--r--  1 danrl  staff   351K Aug  3 10:35 step2-remove-sharded.ac
✂️
-rw-r--r--  1 danrl  staff   351K Aug  3 10:35 step2-remove-sharded.cl

To run them in parallel I looped over them and sending the processes to the background:

$ for FNAME in step2-remove-sharded.*; do sh $FNAME & done

Looking at htop and iftop I have the feeling that the bottleneck really was the CPU here. The poor thing was context switching all the time, desperately waiting for I/O.

htop

As expected, memory and bandwidth usage was rather low. The instance had tens of Gigabytes of memory left unused and could have easily handled 10 GBit/s of network I/O.

iftop

In total, the shell scripts ran for three hours costing us a little less than USD 5. That is orders of magnitude cheaper than any investment in engineering time. Sometimes, a trade-off means that we wouldn’t build the fancy solution but rather throw compute or memory at a one-time problem.

The SQL Script: Managing Risk

The more interesting, because more delicate, part of the migration was running the SQL statements on the live production database. Relational databases are a piece of work… Not necessarily a distributed system designer’s dream but that’s another story.

When the reviewer and I deployed the SQL change, we gradually took more risk as we proceeded. First, we started with a single statement of which we knew it was only affecting an image belonging to a well-known account.

After executing this single statement we ran some tests to see if everything would work as expected, including the caches. Since all tests were green, we were going for ten statements. Then we tested again. We increased to 100 statements, 1k statements, and finally settled with a chunk size of 10k statements for the rest of the migration.

This ramp-up of risk (every change carries some risk) is pretty common when we do changes to production. We like to be able to roll back small changes as early as possible to affect only a few customers. On the other hand, we like to get the job done eventually. We know that engineering time is precious and that hate boring, repeating work. We use this pattern of increasing by orders of magnitude all the time, from traffic management (e.g. 0.1% of users hitting a new release) to migrating storage objects or table rows.

Conclusion

With a hands-on approach and by making reasonable trade-offs, we were able to migrate the legacy image names unnoticed by our users. Once again we touched production without causing a disaster. As we say in my team whenever someone asks us what we do: We touch production, every day, all day long, and sometimes during the night.

Touching Production: What does that mean? (Part 1)

Sometimes people ask me what I do all day as an SRE. I usually reply that I touch production every day. But what does touching production even mean? Let me share the typical SRE task of preparation for touching production at eGym (the company I work for).

The Problem Statement

We have a product that allows users to create and upload images of a certain kind. Those images are stored in a bucket on Cloud Storage. The image name is a long randomized string (similar to UUID). The image name is then referenced in a relational database table. At some time in the past, we used very short image names of up to six characters. When we began making the image names part of the external API, we had to rename those legacy images. Longer, better-randomized names are harder to predict and increase the security in case an attacker starts guessing image names. Some images were still using the old names. My task was to migrate those legacy images to longer, hard to predict names while requests were coming in.

Assessing The Scale And Impact

The second thing I usually do is to assess the scale and the expected impact of a task. The first thing is always making sure I understood the problem correctly by talking to the people who issues the request or developed the systems I am about to touch. The scale and the expected impact determine which tools I use and what approaches are feasible. Here I had to understand if we were talking about a month-long migration of data while all systems continue to serve traffic, or if we can apply some database changes in a single transaction and be done in a minute.

I queried the read replica of the production database to get the number of rows that host old-style image names (those with a length of six characters or less):

SELECT COUNT(id) FROM Image where imageType='MY_TYPE' and length(imageId) <= 6;

The result was something around 150k rows. That’s not much. This was a number that I could easily handle in memory on my local machine. From the problem statement, I knew that all new image names have been using much longer, randomized names for a long time. So the dataset we are talking about is stable and not going to change between migration planning and actual migration. A dynamic approach was therefore not needed.

Preparing Metadata Migration

To start development I wanted to have a handy copy of the dataset. I ran the select query again, but this time fetching every row and exporting into a CSV file:

SELECT id, imageId FROM Image where imageType='MY_TYPE' and length(imageId) <= 6;

I peeked into the resulting file to make sure I got the right thing:

$ head dataset.csv 
id,imageId
844365,hytinj
344614,hyt459
460974,hyt8is
834613,hytlf4
832009,hytmps
334627,hytug5
408177,hyt4c4
692956,hyt8u1
874342,hytb7g

I also wanted to make sure I got all the rows. So another sanity check was to count the lines of the CSV file:

$ wc -l < dataset.csv 
  155468

That looked good! Now I wanted to have a new image name, ideally a UUID, for every image. An easy way to do that is to just pipe the file through a custom program that does exactly that. My favorite language is currently Golang, so guess in what language I was writing the tool?

func main() {
  scanner := bufio.NewScanner(os.Stdin)
  for scanner.Scan() {
    fmt.Printf("%v,%v\n", scanner.Text(), uuid.New().String())
  }
  if err := scanner.Err(); err != nil {
    log.Fatal("scan: ", err)
  }
}

This program read from standard input and added a generated UUID to the input line. Something similar to ,1234-567-890. An line reading foo,bar on standard input becomes foo,bar,1234-567-890 on the standard output. This allowed me to create a new CSV file based on the dataset.csv file.

tail -n+2 dataset.csv | go run main.go > dataset-new.csv

Hint: tail -n+2 skips the CSV header line.

Peaking into the output gave me this:

$ head -n 3 dataset-new.csv 
844365,hytinj,cd616cba-52dd-4b81-b358-ed5e5672ae4c
344614,hyt459,88d1debe-4e9e-4482-9c06-b656efadfd62
460974,hyt8is,981d9276-2e93-47b7-962a-4ad35edf995a

The file dataset-new.csv is now the source of truth for how the rows should look like in the future. The only thing that is missing for the database part of this migration is a set of queries that we can apply. Sticking to my preference for small Golang tools I modified the previously used program to look like this:

func main() {
  scanner := bufio.NewScanner(os.Stdin)
  for scanner.Scan() {
    csv := strings.Split(scanner.Text(), ",")
    fmt.Printf("UPDATE `Image` SET `imageId`='%v' WHERE `id`='%v';\n",
      csv[2], csv[0])
  }
  if err := scanner.Err(); err != nil {
    log.Fatal("scan: ", err)
  }
}

This would create SQL queries based on the data in the CSV. I saved the queries in a file for later use:

$ go run main.go < dataset-new.csv > migration.sql

And then I ran the usual sanity checks:

$ wc -l < migration.sql
  155467
$ head -n 3 migration.sql
UPDATE `Image` SET `imageId`='cd616cba-52dd-4b81-b358-ed5e5672ae4c' WHERE `id`='844365';
UPDATE `Image` SET `imageId`='88d1debe-4e9e-4482-9c06-b656efadfd62' WHERE `id`='344614';
UPDATE `Image` SET `imageId`='981d9276-2e93-47b7-962a-4ad35edf995a' WHERE `id`='460974';

That was looking good! The queries for updating the image metadata table in the relational database are done. But the actual files need to be renamed for the reference to be valid.

Preparing Storage Object Migration

Preparing the storage object migration turned out to be a bit more complicated. We not only store the image binary data on Cloud Storage, we also store variations of the file. Those variations have an object name that follows a particular pattern. So for an image named foo we store at least three objects in the bucket:

  • foo: The original
  • foo_b: A variation of the original
  • foo_m: Another type of variation

These variations are all present for all objects that I had to potentially touch. From the documentation, I could also see that there might be just another variation foo_l. However, it was not clear if they are still in the bucket or already deprecated. I had to find this out before I could continue.

I got myself the list of all items in the bucket using the gsutil command:

$ gsutil ls gs://my-bucket/dir1/dir2/ > objects.txt

That yielded a very long list of object paths:

$ head -n 3 objects.txt 
gs://my-bucket/dir1/dir2/<random string>
gs://my-bucket/dir1/dir2/<random string>_b
gs://my-bucket/dir1/dir2/<random string>_m

To skip non-variations I used grep matching on the underscore (which we use in variations only). I piped the result to sed to extract the variation types from the object paths:

$ grep '_' < objects.txt | sed -E 's/^(.*)_(.+)$/\2/'
b
m
b
...

I got a long list of variations. Way too many for a human to check by hand. Since I was only interested in the type of variations, not the number of variations, I used the popular dream team sort and uniq to minimize the dataset:

$ grep '_' < objects.txt |  sed -E 's/^(.*)_(.+)$/\2/' | sort | uniq
b
m

This is for sure not a very efficient way, but on a dataset as small as the one I was dealing with, the whole operation only took a couple of seconds. Luckily, the result showed that I only had to care about the b and m variations. These are the only ones in production currently. Cool!

One thing I had to keep in mind was, that if I changed the image names in the relational database, I also had to change it at the same time on Cloud Storage. There is no such thing as “at the same time” in computing. So I had to have a migration strategy to ensure consistency at all times. The strategy was rather simple, though:

Copy all affected objects to their new names Run the database transaction Remove the old objects after a cool-down period (image names may be cached, we may want to roll back the transaction, you name it…)

I had the SQL queries already. The other two things that were missing were the bucket modifications. Since I wasn’t in a hurry, I chose to just generate myself a shell script that calls gsutils over and over again. Again, this is not a very efficient solution. In SRE, we chose efficiency over simplicity very selectively. As a rule of thumb, you could say: If it fits into memory, consider manipulating it there instead of introducing additional complexity.

Generating the migration scripts was as easy as changing a couple of lines in my little Golang helper program.

func main() {
  scanner := bufio.NewScanner(os.Stdin)
  for scanner.Scan() {
    csv := strings.Split(scanner.Text(), ",")
    fmt.Printf("gsutil cp 'gs://my-bucket/dir1/dir2/%v'   "+
      "'gs://my-bucket/dir1/dir2/%v'\n",
      csv[1], csv[2])
    fmt.Printf("gsutil cp 'gs://my-bucket/dir1/dir2/%v_b' "+
      "'gs://my-bucket/dir1/dir2/%v_b'\n",
      csv[1], csv[2])
    fmt.Printf("gsutil cp 'gs://my-bucket/dir1/dir2/%v_m' "+
      "'gs://my-bucket/dir1/dir2/%v_m'\n",
      csv[1], csv[2])
  }
  if err := scanner.Err(); err != nil {
    log.Fatal("scan: ", err)
  }
}

I ran the program to generate a shell script.

$ go run main.go < dataset-new.csv > step1-copy.sh
$ head -n 3 step1-copy.sh
gsutil cp 'gs://my-bucket/dir1/dir2/hytinj'   'gs://my-bucket/dir1/dir2/cd616cba-52dd-4b81-b358-ed5e5672ae4c'
gsutil cp 'gs://my-bucket/dir1/dir2/hytinj_b' 'gs://my-bucket/dir1/dir2/cd616cba-52dd-4b81-b358-ed5e5672ae4c_b'
gsutil cp 'gs://my-bucket/dir1/dir2/hytinj_m' 'gs://my-bucket/dir1/dir2/cd616cba-52dd-4b81-b358-ed5e5672ae4c_m'

This script can be run from the shell of a maintenance host with access to the production data. I needed the same for the deletion step. At this point you can probably predict what the code will be:

func main() {
  scanner := bufio.NewScanner(os.Stdin)
  for scanner.Scan() {
    csv := strings.Split(scanner.Text(), ",")
    fmt.Printf("gsutil rm 'gs://my-bucket/dir1/dir2/%v'\n", csv[1])
    fmt.Printf("gsutil rm 'gs://my-bucket/dir1/dir2/%v_b'\n", csv[1])
    fmt.Printf("gsutil rm 'gs://my-bucket/dir1/dir2/%v_m'\n", csv[1])
  }
  if err := scanner.Err(); err != nil {
    log.Fatal("scan: ", err)
  }
}

I spare you the output. But it is a list of shell commands in the form of gsutil rm <object>.

Due Diligence

Humans sometimes make mistakes. Humans that work on automation or scripting migrations sometimes create disasters. To avoid disasters (or at least the obvious ones) every piece of code that changes production has to go through a review process on my team. I submitted the files step1-copy.sh, migrations.sql, and step2-remove.sh for review and can’t wait to see what mistakes my fellow engineers will find. They are the best at spotting those. 🧐 Only after scripts and transactions have been reviewed, we actually touch production.

I hope you enjoyed that little peek into what one of the many forms of touching production is prepared.