The Machine That Hires Me

Have you ever read Ibrahim Diallo’s famous, scary, and funny blog post The Machine Fired Me? Ibrahim, working as a software developer, accidentally got fired. Thanks to a fully automated business process his key card, used for physical accessing the facilities, stopped working. Various accounts for all kinds of work-related systems got disabled and he did not receive pay for three weeks. The automation was so powerful, he had to be re-hired to get back into the system. There was no stopping the machinery.

I had (or am I still having?) a similar experience lately. Similar in that I am also in some kind of machinery and the process seems unstoppable. Different in that I am not being fired but I am being hired by a machine.

Earlier this year I was contacted by a recruiter from Facebook on LinkedIn. We started chatting and eventually, I agreed to apply for a Production Engineering role. I had a couple of phone interviews. Then I was invited to London for a day of on-site interviews. I was extended an offer, which I would eventually turn down. All that was a very pleasant experience and I admire Facebook for their professional recruiting process. I genuinely had a lot of fun solving the challenges and interacting with recruiting and engineering. It seems, however, that somewhere in this process the machines took over. While the recruiter and I agreed to end our journey at some point and to keep in touch, the machinery had different plans.

After turning down the offer I still had access to the digital contract signing interface for some days. Furthermore, the onboarding portal suggested I decide on my preferred hardware, including a laptop computer and phone. I received a parcel containing a printed guidebook for new Londoners and a Facebook-branded blanket. That blanket! It is so fluffy!

small

Then another message arrived reading “Congratulations on your new role with Facebook!” and informing me about my upcoming business travels. For the latter, I was asked to apply for a U.S. visa or the ESTA visa waiver program. Most of this happened within a couple of days. Out of curiosity, I peeked into the emails and websites I got sent, but I did not interact further with them. I informed my recruiter so that they know just in case any harm is done. But one does not simply ignore the machinery! A couple of days later the automation poked me again: “We are very excited for you to join the team. It looks like we’re still missing some of your information. Please navigate to the People Portal to complete your outstanding tasks right away.” Let me translate this: “Human! It’s me, the machinery. You are supposed to obey. Do so now.” Even before I could let my recruiter know about the latest developments, they proactively send me a message apologizing for the repeated interaction. While the machinery at Facebook seems unstoppable, the humans are great and caring there!

At this point, I thought this was over now. Essentially some mail triggers went off when they shouldn’t, not a big deal, right? I was wrong. A month later I received a mail from Altair Global, a relocation services provider. There was no reference to Facebook in the mail. So I mistakenly related it to a different opportunity and clicked the link in the mail. A few seconds later I had an account with Altair Global asking me to complete a bunch of tasks for my upcoming move to London. Wait, what? I am moving to London? Oh! This must be the machinery that won’t stop hiring me. And yes, looking at the dashboard page of my unwanted relocation I was able to spot the Facebook logo. 🧐

It’s the machinery again. I contacted my assigned Altair relocation consultant and asked them to maybe check with their customer Facebook if this relocation is still something they want to pay for. The time is running out on some of the tasks. I am afraid the machinery will notice and poke me again for being a bad human. Forgive me, oh great automation overlord, for I am just flesh and blood! 🤖😰

Contrary to Ibrahim, who got an unsolicited lesson in job security, no harm was done in my case. Even better, I received gifts and got interesting insights into business process automation.

To be continued…?

SREcon Europe (Report)

What an exhausting conference! So many learnings, valuable conversations, and interesting workshops.

Day One: Roundup

We all started the conference together, in the large ballroom. I learned from Theo Schlossnagle and Heinrich Hartmann that data ingestion at Circonus never stops and that they had to apply impressive engineering to handle the massive load that billions of time series produce.

Afterwards, Corey Quinn and John Looney entered the stage with their Song of Ice and TireFire. I’d rather not spoiler you on this one. Suffice is to say, we had many laughs! It is a relaxing, popcorn-type must-see talk.

The third talk of the day, delivered by Simon McGarr, was about GDPR. Significantly less laughter in that one. Oops! The smell of metaphorical dead bodies filled the room. I can recommend the talk and if only half of it is applicable to your company or product, you won’t be laughing for a while to come. Phew. I am still undecided what to think about GDPR in general.

After the opening talks, I spent most of day one in workshops. Since I missed it in Santa Clara earlier this year, I joined the How to design a distributed system in 3 hous workshop held by Google folks. The workshop included an exercise in Non-abstract Large-scale System Design (see chapter 12 of The Site Reliability Workbook). This is where my SRE flash cards came in handy. I use the SRE flash cards to stay in the game of system design because I have a hard time remembering all the numbers.

Day Two: Dealing with Dark Debt: Lessons Learnt at Goldman Sachs

I met Vanessa Yiu at the speakers reception on Monday. She was, like me, very excited because it was her first speaking engagement in the SRE community. Her talk was perfectly delivered and the slides were exceptionally good. I felt very happy for her, because clearly she had an awesome debut in the community. The talk itself surprised me, though.

One out of three employees in Goldman Sachs is an engineer, including SRE. Woot? Yes, Goldman Sachs is a very technical company. I did not know that. On Dark Debt: In contrast to tech debt, dark debt is not recognizable at the time of creation. So we probably won’t have a tech debt tracking ticket for that in our board. Dark debt is the unforeseen tech debt, if you will. The name is derived from dark matter. Dark matter has effects on its environment, but one cannot see dark matter (because it eats the light). Similar to dark matter, which interacts with its environment, dark debt interacts with hardware and software in a distributed system in an unforeseen way.

Vanessa told us about tactics to manage Dark Debt:

  • Prevention
  • Insight into the environment
  • Detection
  • Diagnose
  • Culture

I took some notes on each.

Prevention

Build sustainable ecosystems. Easier said than done, right? Inhibit the creation of tech debt to begin with. Goldman Sachs has a thingy called SecDb, a central risk management platform. Basically an object-oriented dependency graph. It has its own language and IDE, does risk assessment for financial products, and has its own securities programming language called Slang. A lot of that sounded awesome, but I happen to know that programmers not necessarily love slang. A source that does not want to be named, told me that newbies are often given slang tickets and gain mostly non-transferable skills by that. The same source told me, that SecDb is not loved by everyone and that the change process is painfully slow. Please take this with a grain of salt, as I cannot verify any of that. 🙈🙊🐵

Back to topic: So, SecDb was created 25 years ago and is still evolving on a daily basis. Vanessa said, SecDb is basically two weeks old. How did they do that? The development process was very transparent and collaborative from the beginning. Every developer can see the entire source code. Everyone can execute the whole thing locally. Everyone can fix bugs and develop shared tools around SecDb. It has a strong focus on re-use. There is also a form of gamification to improve the code base: Developers get points for adding or removing code. However, more points are awarded for removing code.

And get this: Bots go through the code and remove parts that have not been used for a while. Triggering unused code is such a risk in the securities field, that they automated removing unused code. Wondering how that works in practice. 🤨 From Q&A: The bot flags the code and a human does the actual work then. I got that wrong! See Vanessa’s message below. The bot does remove the code! How cool is that?

Insight

With proper insight into the production environment, dark debt is easier to detect. I took a lot of notes here, but when I went over the notes, I figured that this is more or less just a monitoring/instrumentation framework that you can find in every other large organisation. Goldman Sachs is a Java shop, by the way. So think central collection of JRE metrics.

Diagnose

Controlled chaos is good for you. Goldman Sachs does fault injection and chaos engineering. Of course, not everywhere. However, new systems do get chaos engineered from the very beginning. For example, in production (see clarification below) there is always number of orders (e.g. buy/sell orders) that are rejected at random to make sure all retry-logic in all clients is in good shape. Wondering how that works with time-critical stock orders? Microsecond trading anyone? I guess I have no idea what exactly Goldman Sachs does, to be honest.

Visualise

Use tracing! Vanessa presented a couple of traces that nicely showed how visualisation can help spot problems much better than any log file. Can confirm. We recently added tracing to our most important artifacts and we gained a ton of new insights by that alone. By that, we found bugs before they hit a significant number of users. I absolutely agree: Use tracing. Start now.

Culture

Don’t play whack-a-mole. Don’t jump to patching every single edge case. The system will change and edge cases may become obsolete. It might be a bad investment to throw engineering time at edge cases. Say No to dark matter developers: No one should be developing in isolation or keep knowledge in their heads. Do not ignore technical debt. At least track it. Increase runtime transparency. Transparency is good for you. (It has electrolytes) Practice blameless post mortems (also obvious). Emphasis on blameless, here. Share knowledge by pair programming and peer code reviews. SRE hackathons for refactoring or dedicated sprints. Reserve exclusive time to fix the things on the wishlist of SRE.

Update Sep 1st, 2018: Apparently, I got a couple of things wrong in my notes. Vanessa helped me out. With her permission, here’s a copy of a message I received:

Hi Dan! Thanks for attending and writing up a summary of my talk :) I really appreciate you taking the time.

I just wanted to help clarify a few points.

For the chaos engineering part, we don’t do this for business facing/trading systems. The example I referred to is for an internal facing infrastructure provisioning system. We reject a fixed percentage of provisioning orders in the production flow at random there.

For business facing/ trading systems we typically do fault injection and stress tests only. We’ve been doing this for a long time to ensure we have confidence in both system capacity and our business logic/controls (e.g. if market suddenly swings and generate 3x volume, or if someone fat fingered a trade and put in a completely wrong price or quantity.)

With regards to SecDb, the bot does actually remove the code too :) The human just has to approve the code review that the bot raise, more as a control/audit than anything else. If the code reviewer says yes, then the bot removes the code and push update automatically.

For JRM, I probably didn’t explain myself well enough on this one… the key point wasn’t so much the monitoring or what gets monitored, but the fact that the actual application monitoring is decoupled from application logic, and the monitoring config is also decoupled from the monitoring agent. ie. each of those three things can have different release cycles of each other. I will have a think on how to reword and express that better!

I really enjoyed the whole experience and I hope to see you at future SREcons again! :D

Best wishes, Vanessa

I apologize for the misunderstandings and hope they did not cause any trouble! Thank you very much Vanessa for taking the time to clarify things!

This shows how important it is to go and watch the video if one is interested in the whole story. Don’t ever trust my notes :) Even I don’t trust them fully.

Day Two: Know Your Kubernetes Deploys

As a retired infosec person, I do enjoy hearing about the progress the field is making. Especially in the Kubernetes realm. We all know that Kubernetes is the new computing stack, right? Whatever your opinion is on that, you might like Felix Glaser’s excellent talk about Shopify’s production security engineering efforts in deploying trusted images. Production Security Engineering at Shopify takes care about everything that happens below the application level. So we are talking Docker and containers here.

Felix argued that FROM foo:latest is the new curl | sudo sh. I could not agree more.

How to fix it, then?

Shopify has a gate service which decides which images are OK to pull and run. This service is called Kritis. Kritis is basically an admission controller for gating deployments to only use signed images. Rogue admins can not deploy unsigned images anymore. Shopify wrote an attestor (that’s a term from the binauthz realm) which is called Voucher. Voucher runs a couple of checks on images before they are admitted into production.

This begs the question, what to do in emegency situations and incidents? When we quickly need to deploy and cannot wait for all the checks and reviews? Turns out, one can still deploy if there is a special “break glass” annotation in the Kubernetes deployment. However, that immediately triggers a page to the cloud security team of shopify. Then a security engineer jumps in to help with root cause analysis. Or to defend against the attacker.

Cool thing!

Day Two: How We Un-Scattered Our DNS Setup and Unlocked New Automation Options

This was the talk I was most excited about. Not sure why. However, I was so busy during the talk that I could not take any notes. But here is a long write-up of the talk. 😇

Day Two: Managing Misfortune for Best Results

We all know the Wheel Of Misfortune, an exercise that Google does its SREs to keep their intuitions sharp. The Managing Misfortune for Best Results talk was about how to design and deliver those scenarios.

A couple of factors make a successful team. These include (in that order!):

  • Psychological safety
  • Dependability
  • Structure and clarity
  • Meaning. Very related to job satisfaction.
  • Impact. Delivering value for Google is interestingly the least important factor. Huh?

This matches what Google published at re:Work. Re:Work is, BTW, a highly recommended read! I love that page!

The goal is to deliver a high value training experience. A carefully calibrated stress load. Carefully, because we want teams to survive the training.

Regarding scenario selection: As a trainer, you have to order the learning path. Select scenarios covering recent study areas of the team. Calbrate difficulty with experience (not every team gets the same training).

For cross team exercises: Google follows the IMAG protocol (Incident Management at Google). Up to 2016, different teams had different incident management protocols. Some went straight to IRC, others to a shared doc. Different teams have different habits and culture in handling incidents. Account for that.

Monitoring bookmarks. If your monitoring system provides bookmarking functionality, make use of that. Instead of the dungeon master saying “red line is going up” just link to a graph that represents that. Or share a screenshot. The more real it looks, the better.

Maintain playbooks of useful outages. Maintain a list of outages and re-use for each team member. This applies to conceptual outages, e.g. a bad binary or bad data hitting a server.

Tips for the training session: Someone should transcribe the session. A log of the exercise basically. This helps in the debriefing, because it provides some data of what was done in response to what. The log should be shared after the session.

The talk was quickly over. But then, instead of a long Q&A, a volunteer got onto the stage and the speaker ran a fictional exercise with that person. Kudos to the person. brave move! Have a look at the video once it is out. It was really interesting!

Day Two: Food!?

By the end of day two the never ending supply of food made me think: When did they stop serving us food and began feeding us? And why?

Day Three: Roundup

Another day that I focussed on workshops. Later I had some other things to take care of and missed some of the talks or did not have time to take notes.

I remember the Delete This: Decommissioning Servers at Scale talk by Anirudh Ra from Facebook being very funny. I could feel the pain of having to drain machines in every single sentence. My colleagues and I had an awesome time listening to this talk. We may have our own story with machines not being drained in time. 😥

Conclusion

This time I had soooo many highly appreciated conversations that I almost forgot to take notes. We also had a production incident that I followed remotely to the extent possible. On top of that, I had some other things to take care of. Nevertheless, I learned a ton of new things and got to know more people from the community.

Thanks y’all and see you again in Brooklyn next year!

Special thanks fly out to Nora, my mentee and most critical spell checker. :)

Touching Production: Review and Change (Part 2)

Two weeks ago I wrote about touching production. I described how I prepared scripts and queries for a migration of image names. The images are stored in Cloud Storage and their object names are referred to in a relational database. I came up with three steps for the migration, all capable of being applied while the site continues to serve the images.

  • Copy old storage objects to new storage objects.
  • Update the table in the relational database to refer to the new name.
  • Remove old storage objects from Cloud Storage.

For the first and the second step I came up with shell scripts. Basically hundreds of thousands of lines calling gsutil, the command line utility to administer Cloud Storage. The second step was a file containing about 150k SQL UPDATE statements.

The Review

It is ~good~ required practice in my team that we review each others work. The systems we manage are incredibly complex and every one of us has a different mental model of how our systems work. And then there is how the systems actually work. 🙃 Reviews are therefore essential to avoid the biggest disasters and keep things running smoothly.

Pushing a change of roughly a million lines through review needs good communication. It is not enough to just drop the files in a pull request and wait for someone to pick it up. So I explained to the reviewer how I came up with the files. What I believe the systems looks like today and how I would like the system to look and behave tomorrow. This may be the most underappreciated part of conducting reviews: Having a chance to synchronize mental models inside SRE and across teams. The commit message is often an executive summary of what has been done and what the overall goal of the change is. However, by pairing up and walking someone through my thought process has not only been an extremely valuable feedback loop for myself but also lead to better code in the end.

Back to the migration change: The reviewer came up with some additional test cases and together we developed a plan for applying the migration scripts. We also had an interesting discussion about whether or not the shell scripts are I/O bound.

The Shell Scripts: Trade-offs

The shell scripts had each ~450k lines of calling gsutil. As far as I knew, gsutil has no batch mode. That’s why I had two options only:

  • Call gsutil, a thoroughly tested and trusted tool again and again. This puts a lot of overhead on the kernel for spawning new processes and context switching between them.
  • Write a tool that repeatedly makes calls to the API, thus implementing the missing batch behavior. This tool would need to get tested thoroughly before being ready for showtime on production.

Our SRE team is small which implies that engineering time is almost always more precious than computing time. That’s why I made the decision to rather spend some compute resources than investing another two or three hours into a custom tool that we would use only once. But how much compute are we talking about here? And what is the bottleneck when we run the scripts? My reviewer suggested it might be I/O bound because gsutil operations often take up to a second. Most of the time is spent waiting for Cloud Storage to return how the operation went. I was under the impression that whenever we would wait for a call to return, we could schedule another process to do it’s magic (for example, starting up).

To find out I created an instance with 64 CPU cores and more than enough memory to fit the processes and data.

We’ll have a look at the step2-remove.sh script, but more or less the same applies for the other shell script, too.

The file’s content looked like this:

gsutil rm 'gs://my-bucket/dir1/dir2/hytinj'
gsutil rm 'gs://my-bucket/dir1/dir2/hytinj_b'
gsutil rm 'gs://my-bucket/dir1/dir2/hytinj_m'

In total, the file had 466,401 lines like that. To distribute the workload on all 64 cores I split the file at 7288 lines, that is 466,401 divided by 64 plus and then incremented by 1 to make up for rounding errors.

$ split -l 7288 step2-remove.sh step2-remove-sharded.

That gave me 64 files of roughly the same length:

$ ls step2-remove-sharded.*
-rw-r--r--  1 danrl  staff   351K Aug  3 10:35 step2-remove-sharded.aa
-rw-r--r--  1 danrl  staff   351K Aug  3 10:35 step2-remove-sharded.ab
-rw-r--r--  1 danrl  staff   351K Aug  3 10:35 step2-remove-sharded.ac
✂️
-rw-r--r--  1 danrl  staff   351K Aug  3 10:35 step2-remove-sharded.cl

To run them in parallel I looped over them and sending the processes to the background:

$ for FNAME in step2-remove-sharded.*; do sh $FNAME & done

Looking at htop and iftop I have the feeling that the bottleneck really was the CPU here. The poor thing was context switching all the time, desperately waiting for I/O.

htop

As expected, memory and bandwidth usage was rather low. The instance had tens of Gigabytes of memory left unused and could have easily handled 10 GBit/s of network I/O.

iftop

In total, the shell scripts ran for three hours costing us a little less than USD 5. That is orders of magnitude cheaper than any investment in engineering time. Sometimes, a trade-off means that we wouldn’t build the fancy solution but rather throw compute or memory at a one-time problem.

The SQL Script: Managing Risk

The more interesting, because more delicate, part of the migration was running the SQL statements on the live production database. Relational databases are a piece of work… Not necessarily a distributed system designer’s dream but that’s another story.

When the reviewer and I deployed the SQL change, we gradually took more risk as we proceeded. First, we started with a single statement of which we knew it was only affecting an image belonging to a well-known account.

After executing this single statement we ran some tests to see if everything would work as expected, including the caches. Since all tests were green, we were going for ten statements. Then we tested again. We increased to 100 statements, 1k statements, and finally settled with a chunk size of 10k statements for the rest of the migration.

This ramp-up of risk (every change carries some risk) is pretty common when we do changes to production. We like to be able to roll back small changes as early as possible to affect only a few customers. On the other hand, we like to get the job done eventually. We know that engineering time is precious and that hate boring, repeating work. We use this pattern of increasing by orders of magnitude all the time, from traffic management (e.g. 0.1% of users hitting a new release) to migrating storage objects or table rows.

Conclusion

With a hands-on approach and by making reasonable trade-offs, we were able to migrate the legacy image names unnoticed by our users. Once again we touched production without causing a disaster. As we say in my team whenever someone asks us what we do: We touch production, every day, all day long, and sometimes during the night.