SREcon Europe (Report)

What an exhausting conference! So many learnings, valuable conversations, and interesting workshops.

Day One: Roundup

We all started the conference together, in the large ballroom. I learned from Theo Schlossnagle and Heinrich Hartmann that data ingestion at Circonus never stops and that they had to apply impressive engineering to handle the massive load that billions of time series produce.

Afterwards, Corey Quinn and John Looney entered the stage with their Song of Ice and TireFire. I’d rather not spoiler you on this one. Suffice is to say, we had many laughs! It is a relaxing, popcorn-type must-see talk.

The third talk of the day, delivered by Simon McGarr, was about GDPR. Significantly less laughter in that one. Oops! The smell of metaphorical dead bodies filled the room. I can recommend the talk and if only half of it is applicable to your company or product, you won’t be laughing for a while to come. Phew. I am still undecided what to think about GDPR in general.

After the opening talks, I spent most of day one in workshops. Since I missed it in Santa Clara earlier this year, I joined the How to design a distributed system in 3 hous workshop held by Google folks. The workshop included an exercise in Non-abstract Large-scale System Design (see chapter 12 of The Site Reliability Workbook). This is where my SRE flash cards came in handy. I use the SRE flash cards to stay in the game of system design because I have a hard time remembering all the numbers.

Day Two: Dealing with Dark Debt: Lessons Learnt at Goldman Sachs

I met Vanessa Yiu at the speakers reception on Monday. She was, like me, very excited because it was her first speaking engagement in the SRE community. Her talk was perfectly delivered and the slides were exceptionally good. I felt very happy for her, because clearly she had an awesome debut in the community. The talk itself surprised me, though.

One out of three employees in Goldman Sachs is an engineer, including SRE. Woot? Yes, Goldman Sachs is a very technical company. I did not know that. On Dark Debt: In contrast to tech debt, dark debt is not recognizable at the time of creation. So we probably won’t have a tech debt tracking ticket for that in our board. Dark debt is the unforeseen tech debt, if you will. The name is derived from dark matter. Dark matter has effects on its environment, but one cannot see dark matter (because it eats the light). Similar to dark matter, which interacts with its environment, dark debt interacts with hardware and software in a distributed system in an unforeseen way.

Vanessa told us about tactics to manage Dark Debt:

  • Prevention
  • Insight into the environment
  • Detection
  • Diagnose
  • Culture

I took some notes on each.

Prevention

Build sustainable ecosystems. Easier said than done, right? Inhibit the creation of tech debt to begin with. Goldman Sachs has a thingy called SecDb, a central risk management platform. Basically an object-oriented dependency graph. It has its own language and IDE, does risk assessment for financial products, and has its own securities programming language called Slang. A lot of that sounded awesome, but I happen to know that programmers not necessarily love slang. A source that does not want to be named, told me that newbies are often given slang tickets and gain mostly non-transferable skills by that. The same source told me, that SecDb is not loved by everyone and that the change process is painfully slow. Please take this with a grain of salt, as I cannot verify any of that. 🙈🙊🐵

Back to topic: So, SecDb was created 25 years ago and is still evolving on a daily basis. Vanessa said, SecDb is basically two weeks old. How did they do that? The development process was very transparent and collaborative from the beginning. Every developer can see the entire source code. Everyone can execute the whole thing locally. Everyone can fix bugs and develop shared tools around SecDb. It has a strong focus on re-use. There is also a form of gamification to improve the code base: Developers get points for adding or removing code. However, more points are awarded for removing code.

And get this: Bots go through the code and remove parts that have not been used for a while. Triggering unused code is such a risk in the securities field, that they automated removing unused code. Wondering how that works in practice. 🤨 From Q&A: The bot flags the code and a human does the actual work then. I got that wrong! See Vanessa’s message below. The bot does remove the code! How cool is that?

Insight

With proper insight into the production environment, dark debt is easier to detect. I took a lot of notes here, but when I went over the notes, I figured that this is more or less just a monitoring/instrumentation framework that you can find in every other large organisation. Goldman Sachs is a Java shop, by the way. So think central collection of JRE metrics.

Diagnose

Controlled chaos is good for you. Goldman Sachs does fault injection and chaos engineering. Of course, not everywhere. However, new systems do get chaos engineered from the very beginning. For example, in production (see clarification below) there is always number of orders (e.g. buy/sell orders) that are rejected at random to make sure all retry-logic in all clients is in good shape. Wondering how that works with time-critical stock orders? Microsecond trading anyone? I guess I have no idea what exactly Goldman Sachs does, to be honest.

Visualise

Use tracing! Vanessa presented a couple of traces that nicely showed how visualisation can help spot problems much better than any log file. Can confirm. We recently added tracing to our most important artifacts and we gained a ton of new insights by that alone. By that, we found bugs before they hit a significant number of users. I absolutely agree: Use tracing. Start now.

Culture

Don’t play whack-a-mole. Don’t jump to patching every single edge case. The system will change and edge cases may become obsolete. It might be a bad investment to throw engineering time at edge cases. Say No to dark matter developers: No one should be developing in isolation or keep knowledge in their heads. Do not ignore technical debt. At least track it. Increase runtime transparency. Transparency is good for you. (It has electrolytes) Practice blameless post mortems (also obvious). Emphasis on blameless, here. Share knowledge by pair programming and peer code reviews. SRE hackathons for refactoring or dedicated sprints. Reserve exclusive time to fix the things on the wishlist of SRE.

Update Sep 1st, 2018: Apparently, I got a couple of things wrong in my notes. Vanessa helped me out. With her permission, here’s a copy of a message I received:

Hi Dan! Thanks for attending and writing up a summary of my talk :) I really appreciate you taking the time.

I just wanted to help clarify a few points.

For the chaos engineering part, we don’t do this for business facing/trading systems. The example I referred to is for an internal facing infrastructure provisioning system. We reject a fixed percentage of provisioning orders in the production flow at random there.

For business facing/ trading systems we typically do fault injection and stress tests only. We’ve been doing this for a long time to ensure we have confidence in both system capacity and our business logic/controls (e.g. if market suddenly swings and generate 3x volume, or if someone fat fingered a trade and put in a completely wrong price or quantity.)

With regards to SecDb, the bot does actually remove the code too :) The human just has to approve the code review that the bot raise, more as a control/audit than anything else. If the code reviewer says yes, then the bot removes the code and push update automatically.

For JRM, I probably didn’t explain myself well enough on this one… the key point wasn’t so much the monitoring or what gets monitored, but the fact that the actual application monitoring is decoupled from application logic, and the monitoring config is also decoupled from the monitoring agent. ie. each of those three things can have different release cycles of each other. I will have a think on how to reword and express that better!

I really enjoyed the whole experience and I hope to see you at future SREcons again! :D

Best wishes, Vanessa

I apologize for the misunderstandings and hope they did not cause any trouble! Thank you very much Vanessa for taking the time to clarify things!

This shows how important it is to go and watch the video if one is interested in the whole story. Don’t ever trust my notes :) Even I don’t trust them fully.

Day Two: Know Your Kubernetes Deploys

As a retired infosec person, I do enjoy hearing about the progress the field is making. Especially in the Kubernetes realm. We all know that Kubernetes is the new computing stack, right? Whatever your opinion is on that, you might like Felix Glaser’s excellent talk about Shopify’s production security engineering efforts in deploying trusted images. Production Security Engineering at Shopify takes care about everything that happens below the application level. So we are talking Docker and containers here.

Felix argued that FROM foo:latest is the new curl | sudo sh. I could not agree more.

How to fix it, then?

Shopify has a gate service which decides which images are OK to pull and run. This service is called Kritis. Kritis is basically an admission controller for gating deployments to only use signed images. Rogue admins can not deploy unsigned images anymore. Shopify wrote an attestor (that’s a term from the binauthz realm) which is called Voucher. Voucher runs a couple of checks on images before they are admitted into production.

This begs the question, what to do in emegency situations and incidents? When we quickly need to deploy and cannot wait for all the checks and reviews? Turns out, one can still deploy if there is a special “break glass” annotation in the Kubernetes deployment. However, that immediately triggers a page to the cloud security team of shopify. Then a security engineer jumps in to help with root cause analysis. Or to defend against the attacker.

Cool thing!

Day Two: How We Un-Scattered Our DNS Setup and Unlocked New Automation Options

This was the talk I was most excited about. Not sure why. However, I was so busy during the talk that I could not take any notes. But here is a long write-up of the talk. 😇

Day Two: Managing Misfortune for Best Results

We all know the Wheel Of Misfortune, an exercise that Google does its SREs to keep their intuitions sharp. The Managing Misfortune for Best Results talk was about how to design and deliver those scenarios.

A couple of factors make a successful team. These include (in that order!):

  • Psychological safety
  • Dependability
  • Structure and clarity
  • Meaning. Very related to job satisfaction.
  • Impact. Delivering value for Google is interestingly the least important factor. Huh?

This matches what Google published at re:Work. Re:Work is, BTW, a highly recommended read! I love that page!

The goal is to deliver a high value training experience. A carefully calibrated stress load. Carefully, because we want teams to survive the training.

Regarding scenario selection: As a trainer, you have to order the learning path. Select scenarios covering recent study areas of the team. Calbrate difficulty with experience (not every team gets the same training).

For cross team exercises: Google follows the IMAG protocol (Incident Management at Google). Up to 2016, different teams had different incident management protocols. Some went straight to IRC, others to a shared doc. Different teams have different habits and culture in handling incidents. Account for that.

Monitoring bookmarks. If your monitoring system provides bookmarking functionality, make use of that. Instead of the dungeon master saying “red line is going up” just link to a graph that represents that. Or share a screenshot. The more real it looks, the better.

Maintain playbooks of useful outages. Maintain a list of outages and re-use for each team member. This applies to conceptual outages, e.g. a bad binary or bad data hitting a server.

Tips for the training session: Someone should transcribe the session. A log of the exercise basically. This helps in the debriefing, because it provides some data of what was done in response to what. The log should be shared after the session.

The talk was quickly over. But then, instead of a long Q&A, a volunteer got onto the stage and the speaker ran a fictional exercise with that person. Kudos to the person. brave move! Have a look at the video once it is out. It was really interesting!

Day Two: Food!?

By the end of day two the never ending supply of food made me think: When did they stop serving us food and began feeding us? And why?

Day Three: Roundup

Another day that I focussed on workshops. Later I had some other things to take care of and missed some of the talks or did not have time to take notes.

I remember the Delete This: Decommissioning Servers at Scale talk by Anirudh Ra from Facebook being very funny. I could feel the pain of having to drain machines in every single sentence. My colleagues and I had an awesome time listening to this talk. We may have our own story with machines not being drained in time. 😥

Conclusion

This time I had soooo many highly appreciated conversations that I almost forgot to take notes. We also had a production incident that I followed remotely to the extent possible. On top of that, I had some other things to take care of. Nevertheless, I learned a ton of new things and got to know more people from the community.

Thanks y’all and see you again in Brooklyn next year!

Special thanks fly out to Nora, my mentee and most critical spell checker. :)