SREcon Asia/Australia Day 3 (Report)

Last day of SREcon Asia 2018. Did I mention how important coffee is to get me started on a conference day? No? Coffee is much-needed. And the coffee here is great. ☕️😜

A word of warning: Today’s talks were packed with information. The report got very long. I apologize in advance!

Interviewing for Systems Design Skills

Sebastian Kirsch from Google Switzerland conducted over 200 interviews, was member of the hiring committee, and trained engineers in interviewing. We can check off credibility at this point I think. When they started hiring SREs in the early days of that role there was no job description because the job did not exist before.

  • Design a system for us in the interview
  • Be aware of performance needs, resource needs, and bottlenecks
  • Design from scratch for problems that have not been solved or existed before
  • They need to know why each selected component is a good fit

Other system aspects in regards to manageability and lifecycle:

  • How to maintain the components
  • How to scale the system
  • How to upgrade the systems (versioning)
  • How to migrate data or the system

In an interview your time is limited. So they can’t go into too much details and complexity. But there is a trick: Build a simple system and scale it up. Even simple systems become complex within minutes by just scaling them up. That can be used to talk about complexity even in time-bounded interviews.

Example question: Design a system to copy a file to some machines. Easy, right? Just do a for loop and rsync the file to the target system. Now, scale! Ludacris requirements are allowed:

  • File is 100GB in size
  • Once a day
  • 100k target machines
  • Machines are placed on the moon (uh, latency!)
  • Source machine has a 100Mbit NIC (and here goes sequential file transfering down the sink. Multicast anyone?)

Sebastian didn’t solve the puzzle in his talk, as this was only a demonstration how easy it is to scale a simple question into a complex engineering challenge.

How would one design such a question? Start with a very simple problem. Copy something. Process some data. Pass a message. It should be something from an area where you are familiar with. And then just scale up. Add a crazy requirement so that the problem needs at least a hundred thousand machines to be solved. Next step then is to find the bottlenecks in the system. Typical bottlenecks are disk IO, network bandwidth, disk seek times. A good question has a couple of bottlenecks. Ideally, it should not be clear from the beginning what the bottleneck dimension is. Fun fact: Designing these questions becomes harder over time as hardware contains more magic nowadays. Compared to the past, current computers are supercomputers. Having tons of cores and unreasonable amounts of memory. A lot of problems fit into memory of a 96 core and 1.5TB RAM machine. And that is something that you can just click on Google Cloud Platform. Oh, and disk seeks? You can buy guaranteed IOPs on SSD on the cloud. Snap! Hardware takes all the fun out of engineering problems. Sebastian suggested to just make the problem bigger (oh, did I mention that we want to serve all humans and extraterrestrials? Design for that!) The other option the interviewer has is to limit the candidates option of machines. I like the first option better. We have IPv6, so multi-planet distributed systems should be at least addressable by now.

Another challenge for the interviewer is that at some point maths (e.g. total bandwidth calculation) is going to happen. That has to be checked, while the interviewer has to take notes and think about the next steps already. Solution: Have a cheat sheet with pre-calculated values for the anticipated solutions. Also on the cheat sheet: Possible follow-up questions.

Practice: Never go into an interview with a question that you have not tested. Testing on an external candidate is neither fair nor gives it the required amount of signal. The first iteration is probably way too simple or way too complex. The problem description could be misleading. Practice can iron out all of these problems and calibrate your expectations. Test the questions with your coworkers. Highly relatable! I remember how surprised I was how long it takes to solve one of our coding questions when you never seen it before. I once spend an hour with a candidate sorting an array and still got a ton of signal out of that.

The hiring bar: The problem is not really whether or not a candidate is below or above the hiring bar. The problem is more that we do not know what the candidate’s skill level really is. So the hiring bar problem is more an uncertainty problem. There is a huge error bar or grey area. A lot of candidate will touch the hiring bar at some point. The interviewer’s job is to decrease uncertainty and become certain whether or not a candidate’s skill levels are below or above the hiring bar. It is a classification problem with a lot of uncertainty essentially.

Interview time must be used to decrease uncertainty. Everything else is just time wasted. To gain as much signal as possible, start with making clear what the expectations are. This includes technical requirements. Furthermore, it includes making clear that you expect reasoning about choices and questions from the candidate to the interviewer. Careful steering enables the interviewer to use the time more effectively.

Hints from the practice:

  • Many candidates forget to provide concrete resource estimates (e.g. number of drives needed). Question: What goes on the purchase order for this system?
  • Sometimes there are no clear boundaries between the systems and how they work together. Question: You have N teams of engineers that are going to implement the system. What entities should each team work on?
  • Candidates may get stuck in precise arithmetics. So a hint could be to give roundings: A day consists of N seconds. Use that value to proceed. In the end, it is all about orders of magnitude.
  • The magic bullet: Candidates use a standard technology because they think we ask for knowledge. Question: How does this technology work?

How do we determine how well a candidate did?

Go into the interview with expectations. Compare the results with the expectations. Have expectations for each axis or dimension you are looking for. Does the designed system solve the problem? (Often, the system solves a different problem. Woot?) Good candidates trade resource dimensions for each other. E.g. use more CPU time to reduce network bandwidth. Did the candidate think about a possible SLO?

(Note: No photo here. Projector was broken and Sebastian ran the whole thing without slides. Playing it cool. Well done, sir!)

From the Q&A: Shall I read previous interview feedback on that candidate before going into the interview? Better not. It introduces an unfair bias and gives the first interviewers more weight in the process. At Google they do not propagate the actual feedback, but the topics, so that they minimize overlap without biasing the result.

Link to the talk: Interviewing for Systems Design Skills

Scaling Yourself for Managing Distributed Teams Delivering Reliable Services

Paul Greig is an SRE lead at Atlassian and talked about the challenges one faces in managing a distributed teams. We are talking about 18 SREs, two Team Leads who manage two services distributed across three geographic regions.

He started with the question what a distributed team is. His definition ranges from a single worker somewhere on the world to teams where everyone is in a different place collaborating over the Internet.

Benefits of a distributed teams:

  • Better geographic coverage
  • Talent can be hired regardless of where the talent wants to live
  • Colocation: The SREs can be closer to the developers (Not sure I got this one correctly. I’d say it often is the opposite?!)
  • Better On-call, because time zones.

Challenges of a distributed team:

  • Friction and lack of cohesion: The divide leads to absence of communication sometimes.
  • Duplication: When the right hand doesn’t know what the left hand does, work is often duplicated.
  • Costs: Distributed teams are actually expensive. The impact may be a little lower than that of a team sitting in the same room.
  • Imbalance: The day of a life in an engineer can be very different from that of another one. Paul visited his team members and found that they work very differently.

Now, how to scale yourself (as a leader) in a distributed environment? Three aspects:

  • Presence
  • Planning
  • Balance.

Presence

Establish Trust. ”This is a huge one!” Paul said. Being present and asking questions (includes active listening) he built trust. Jumping to solving problems without establishing trust first doesn’t work. As leading a distributed team involves a lot of travel, you need to plan ahead to keep work and life in a healthy balance. Also, engineers sometimes like to travel to the HQ to mix and mingle. Scheduling one-on-ones at a time where it is suitable for the leader and the team member, who might be in a different time zone. Example: ”After I sent the kids to sleep I spent two hours on being present (online) for the team in a different time zone.” Allow teams some time to think about a problem and get to results. There is a time gap, naturally. However, at some point ask to wrap an issue up. It’s mostly about reasonable expectations. Participate rather than rushing the process.

Planning

Local team vs. remote team: Working with the local team the whole day: How can I make sure that I give the same amount of attention to remote team members? Plan for delays in pull requests. The velocity of big projects can slow down if you are waiting for feedback from a different time zone (and that time zone is asleep currently). Is separating projects and assigning it to different time zones a solution? Not really! So in the retro they were introducing end-of-quarter demos, regardless of the project state, there had to be a demo. That gave the teams a common goal and brought velocity back into the project and kept everyone on the same page.

Balance

Maintain balance for yourself. But also for your team. The team will only remain successful if they take care of their individual balance as well. Includes life work balance obviously Balance three aspects 👩🤚❤️: Head (mental), Hands (practical), and Heart (emotional)

Money Quotes:

  • We saw disheartening duplicate solutions to the same problem. Great for redundancy, but expensive.
  • Inspiration: I want to see a thirst from the engineers of the team to arrive at a fantastic outcome.
  • We did not have these “sitting together at lunch time” sort of moments. So we played Keep talking so nobody explodes. (Also an awesome incident management communication training)
  • I am not saying video games are the solution to every problem, but… 😂
  • Listen, Ask, Tell

Paul published his Talk Resources

Link to the talk: Scaling Yourself for Managing Distributed Teams Delivering Reliable Services

Mentoring: A Newcomer’s Perspective

Leoren Tanyag from REA Group grew up with little exposure to technology which made her want to go into exactly this industry. She went through the graduate/internship program of REA and shared her experiences.

REA acknowledges that school is different and an un-mentored environment is not something they want their people to be exposed to. They communicate expectations on what to learn and where to start. Of course the mentee has a say in driving the direction. Emphasis is on pairing with co-workers.

Concept of the three sixes:

  • On your first 6 days: I don’t know everything
  • On your first 6 weeks: I know what I’m doing. No, you don’t!
  • On your first 6 months: I may know what I need to know, now I have to continually improve.

Interestingly: Even after ten years people often look up to their mentors.

Leoren reports that mentees expectations are often not very high. They are so happy that they can learn from someone they respect. So there is no excuse for an experienced engineer not to mentor. Even of you are not very experienced, there is something you can share and a mentee can learn from you. I can not agree more. I, personally, expect sharing knowledge from every engineer in my team. Mentoring is one form of that and it is everyone’s responsibility to make every other team member succeed.

Forms of mentoring (different in structure and frequency):

  • One on one mentoring: Highly tailored to a mentees needs
  • Group Learning sessions: More efficient if multiple mentees share the same interest (e.g. learning a new technology). Also, some mentees feel less pressure in a group and enjoy having a group to talk to afterwards. Interesting aspect I have not thought about before!
  • Casual Pairing: Supports bonding and transfer hands-on knowledge. Communication heavy, obviously.

Money Quotes:

  • Good Mentors come from those that have been mentored before. Start the trend!
  • Teaching is the best way to learn.
  • Mentoring can help us grow (mentors become role models and may hold themselves to higher standards once they realize that)

From Q&A: How should mentoring relationships be started? Driven my mentee or mentor?

Leoren: We have a queue of people who agree to volunteer time. We use vouchers to get people in touch with each other.

Link to the talk: Mentoring: A Newcomer’s Perspective

(Unrelated side note: Was this Comic Sans in the slides? Looks like Comic Sans but then it also looks different. Why do I care about this at all? Why are fonts such an interesting thing?)

Blame. Language. Sharing: Three Tips for Learning from Incidents in Your Organization

Lindsay Holmwood from Envato started with story from when he was a teenager and was challenged with a serious illness. From there it went up and down (storywise) including topics like emergency room, cancer, surgery leftovers remaining in the body, and open heart surgery. There was a mistake made in the treatment and it was made sure that the person doing the mistake would not practice medicine anymore.

Years later, when he was running a cancer-related charity, he accidentally took down the website in the middle of one of the most important campaigns. He did not receive any punishment for that.

Comparing these two stories, he realized that how an organisation responds to failure and errors was different here. Some look for someone to blame, some don’t. The narrative for safety here is the absence of failure (e.g. in process, technology, people). If you think this through, this maintains a culture where any three of these can be a danger. People are a harm to safety here. Not necessarily a nice environment to work in, right? This all refers to a bi-modal operation mode. It is either safe, or it is not working at all.

In our industry, things are a bit different. Our systems are so large and complex, ther are always running in a partially degraded state. But they still work, often stil generating revenue, even though they are in a degraded state. That asks for a different culture. We have to embrace different perspectives, evn contradictory statements (think: views) about the state of a system can be both true at the same time.

The people doing the frontline work, in our culture, focus on quality and delivery. But they do not focus on covering their mistakes.

Lindsay then identified the three aspects of our culture he thinks are most important:

  • Language
  • Blame
  • Sharing

Language

Why is a good word to use for talking about systems. Why did that action take so long? But it is a bad word to use in regards to people. It often carries some blame. Why didn’t you follow the runbooks? Be careful using Why, you do not want to question someones personality, right? How is better, but may be limiting the scope too much. How did that happen? What ties better with local rationality. The latter is a concept that assumes that people make trade-offs based on the limited information they had at the given time and in good intentions. Think about someone moving forward in a dark tunnel having a small flashlight. You never see the whole picture, only what you point the flashlight too. In hindsight, of course, things are always different. So what asks for how the tunnel looked like at the time you were inside it.

Blame

Thinking of people as hazards ignores one truth: Sometimes bad things happen and no one is to blame. Things go right more often than they go wrong. Finger pointing is basically a cognicitve bias. It’s something we have little control over. Our brain (perception) makes trade-offs between timeliness and accuracy all the time when processing information. That brings problem solving down to heuristics often. Lot of psychology here! He played a classic in confirmation bias: The monkey. Knowing that, think about how well we will do during an incident? Hooray for those who are not biased. Spoiler: No one immune to bias. Another thing to watch out for: In hindsight we tend to assign a failure to a person, not a system. The more negative the outcome is, the more biased we are to blaming a person, not a system.

Sharing (This part was so interesting, I forgot to take notes. Sorry!)

My takeaway: As a leader in SRE, it is your responsibility to create a psychological safe environment. That is quite in line with my personal beliefs and also what I just recently read in Leaders Eat Last. Is this bias in action? 🤔Oh, this is becoming meta now!🤪[But seriously, read the book and learn how we (ex-)military leaders are wired up internally and how we build teams where people are willing to risk their lives for each other!]

This is quite a long writeup of the talk, but, it is only a part of it. There was so much information packed into that hour. If you want to get the whole picture, bookmark the talk and wait for the video to be released. There you can also learn how Lindsay’s teenage story about life and death ended. I won’t spoil it here. (Ok, you obviously figured out that he survived. But the details matter!)

Link to the talk: Blame. Language. Sharing: Three Tips for Learning from Incidents in Your Organization

A Theory and Practice of Alerting with Service Level Objectives

Yesterday evening, at the reception, I asked Jamie Wilkinson (Google) a ton of question related to a tricky SLO measurement problem we face at work currently. Every now and then he would refer to his upcoming talk. So I went into this one with expectations set high.

He explained the differences of alerting on cause vs. alerting on symptons. While it is convenient to know exactly what went wrong (cause) it is much better to alert on symptons. Why? Because it results in less alerts and avoids alert fatigue much better. Every alert disturbs a persons life when on-call. And we care about people more than we care about a dying disk in the end, right?

Focus on alerting on a very small set of expectations that matter for the users of the service. Do not focus on alerting on all the nitty-gritty details, timeouts, and other metrics that may cause a problem.

How do we distinguish between a symptom and a cause? One rule is to ask yourself: Does this affect the users experience? For example, if you have reasonable latency, who cares about the depth of a queue in a backend system? No need to alert for that, we can solve it during office hours. Or maybe decide to not touch it at all, as it might just be fine.

There was also some discussion about where to measure time series and what kind of time series. For example, it is probably easy to increase request counters within the application. However, measuring at the load balancer enables us to also catch those crashing services as they become unavailable. I run a service at work where we use both. Mostly because we agreed to an unfortunate SLO that turned out to be extremely tricky to measure with the given tools. Even Matt suggested that we may be up for an adventure trying to visualize that specific SLO using the current feature set of Stackdriver. Setting SLOs is hard…

Back to the talk: Jamie introduced the concept of SLO burn rate. I like that very much. Alerting on the burn rate is something I definitely will bring back to my team in Munich.

Money Quotes:

  • If we don’t use the error budget to our advantage, it will just be lost. Use the error budget!
  • If a microservice crashes in the cloud and nobody notices, does it make a sound? 🤔😂(I’d add to this: Should it make a sound? Why? Why not?)

Link to the talk: A Theory and Practice of Alerting with Service Level Objectives

Production Engineering: Connect the Dots

Espen Roth talked about how graduates are onboarded in Production Engineering at Facebook. And what challenges come with that. He was hired by Facebook right after school, without a previous internship there, so he experienced the process firsthand.

New graduates build things from scratch in school. They ran small self-contained projects that they had to finish in time. They know about what is currently out there. What they lack is experience in maintaining a system on the long term. Teamwork is even more important in the job world and unlike school, in a job you are there for the long run. And sometimes you have little choice on what you have to work on. The grades don’t matter anymore, but impact does. Working in Production Engineering is very different from school.

As a company interviewing graduates, you want to look at three aspects:

  • Capacity (skills, learning)
  • Passion
  • Opportunity

Internships are the ultimate interviews. Internships provide a long exposure to feedback and hiring a former intern requires little ramp-up.

Link to the talk: Production Engineering: Connect the Dots

Mental Models for SREs

Mohit Suley (Microsoft) woke up the audience by asking a couple of questions to test our mental models before the actual talk started.

The first mental model he talked about was the survivorship bias. The connection to SRE is that survivorship bias can influence metrics.

Then he went on to introduce ludic fallacy and explained it by using examples. We laughed at a system that was designed with too little login capacity. Another example was Tay which still seems to be something that people inside Microsoft joke about. No plan survives first contact with the enemy…

Then I learned about decision fatigue. So obviously, the capacity to make decisions decreases over time. A practical advice is, therefore, to not go online shopping when you’re on-call.

It got better! If a thing has been proven over a long period of time, we call this the Lindy Effect. How to connect this to the SRE world? Mohit suggested that even in the age of cloud, client computing is not going down. And in fact, our devices are more powerful than ever. The cloud (remote compute) just grows faster. Another one: Email is still around in 2018. Probably here to stay.

There is much more:

Link to the talk: Mental Models for SREs

My Summary

This is my personal summary which comes without further explanation. Think of this as a note to myself that accidentally went public:

  • Learn more about rubrics and how I can use them to gain better signal from interviews.
  • Managing a distributed team requires thoughtful and patience leadership.
  • Keep mentoring. It means a lot to mentees. Also, being a mentor and/or a mentee is a learning experience.
  • Humans are biased. Providing psychological safety is key to good leadership.
  • Symptom based alerting. SLO burn rate alerting. It’s better for human health. It makes so much sense!
  • Mental models are fun! Easy to get lost in Wikipedia when you start looking them up…

Wow, you made it to the end of this very long article! Thank you so much for reading. As always, feedback is highly appreciated. As a reward, have this picture of a cute squirrel I met earlier this year in New York: