SREcon Asia/Australia Day 1 (Report)

It is SREcon time again! This iteration takes places in beautiful Singapore! 🌴I flew in yesterday in a half-empty A380. First time A380 (amazing piece of engineering!), first time visiting Asia. It’s also the first time I’m staying in the same hotel where the conference I’m attending takes place. Not the first time I’m investing my savings into getting up to speed with a new role, but maybe the first time I was very generous to myself. Anyway, here’s my report from day one.

The organizers urged us, attendees, to show up early to pick up badges and have breakfast. A minor hiccup with the network prevented the downloading of the conference badges so we all headed to the much-needed coffee and breakfast first. Some, I learned, flew in at midnight and were up again at 7 am to help to organize. Wow. Badge pickup started a few minutes later than planned but as far as I understand was still within the agreed-on SLO range. 😉

Opening Remarks

Paul Cowan and Xiao Li welcomed us and I learned this is the second SREcon in Asia. The organizers set up an on-call room for those of us who could not get ourselves out of rotation. Awesome service! A quick reminder of the USENIX Code of Conduct and the SREcon Slack followed. Some stats: 58 speakers and over 300 attendees this time. More companies, more diversity, and 2% more engineers. Hooray for engineers! 👩‍💻👩‍🔧

The Evolution of Site Reliability Engineering

Benjamin Purgason (LinkedIn) shared his experience with running an SRE team. When he joined the team, on-call was in-office and regular site outages were happening whenever the sun rose over California. It was incredibly helpful to learn that the big players had problems like this, it’s not only us. The founding principles for SRE at LinkedIn are:

  • Site Up (website and backend services)
  • Empower Developer Ownership
  • Operations is an Engineering Problem (They don’t want heroic actions in Ops, but rather build reliable software in the first place.)

I learned about the evolutionary steps of an SRE:

  • The Firefighter: Purely reactive, Incident Management all the time
  • The Gatekeeper: Change control. Protect “our” (SRE) site from “them” (Software Engineers). It is an evolutionary dead end, a team can get stuck in there. Don’t to that!
  • The Advocate: Creating a reliability culture. Rebuilding trusted relationships. Still reactive to Software Engineering plans.
  • The Partner: Empowering intelligent risk. Proactive and joint planning with Software Engineering. Collaborating to magnify the impact.
  • The Engineer: Reliability throughout the software lifecycle. Proactive, once plan for SRE and SWE. Everyone has the same job: Help the company win.

Money Quotes:

  • Every day is Monday in Operations.
  • What gets measured gets fixed!
  • If you solve your biggest problem every day, you start with 100 problems and still have 100 problems a year later. But they have a smaller scope by then.
  • Human gatekeeping doesn’t scale.
  • Attack the problem, not the person.
  • There is no such thing as ‘the hole is in your side of the boat.’ (Fred Kofman)
  • How do you want to spend your time? Help me build a reliable site or help me at 3 am in the morning fighting the fire?
  • Do not insulate, share the pain.
  • Contribute where it counts.
  • Unify SWE and SRE planning and priorities.

Link to the talk: The Evolution of Site Reliability Engineering

Safe Client Behavior

Ariel Goh from Google Sydney dug into the problem of handling over 2bn Android clients with a significantly lower number of servers. Essentially, safe client behavior means Do Not DDoS. Unsafe requests include periodic retries which are not safeguarded by proper backoffs and unintentional syncing. The worst thing that can happen is the backend (servers) going down. Here’s what Ariel suggested for safe client behavior:

  • Add jitter to client code, do not sync periodically without having at least some randomness in the backoff time.
  • A synchronized startup does not seem like a problem, because not everyone starts their app at the same time, right? Well, some apps do background tasks that are bound to a specific time. E.g. synchronize at 4 am in the morning. Adding a jitter to the startup can help here.
  • Do not retry by default!
  • Retry with jitter and capped, exponential backoff and you are a much better behaving citizen.
  • Do not retry on out of quota or client errors (e.g. HTTP 400 errors)
  • Do (carefully) retry on networks and server errors (e.g. HTTP 500 errors)
  • Implement Retry-After header in client and server.
  • Improve debugging by adding tags to requests including client name and version, the feature that triggered the request, if the request is the initial one or a retry.
  • On the server side: Prioritize interactive requests over background requests.
  • Additional tips for microservices: Have retry budgets and adaptive throttling. (The reasoning here is, that microservices in your managed infrastructure probably have more insight into the state of the overall system than some random clients out there in the wild.)

Example code for adding jitter:

Make sure to get your hands on the slides once they are published. A lot of graphs in there showing the effects of different variants of jitter and backoff code. Eye opening!

Ariel summarized the talk as follows:

  • Jitter everything
  • Don’t retry
  • If you retry, back off
  • Move control to the server
  • Expose info to the server
  • Use retry budgets
  • Use adaptive throttling

Link to the talk: Safe Client Behavior

Service Monitoring Manual - 2018 Edition

Nikola Dipanov from Facebook’s Production Engineering talked about monitoring in production. First, we have to ask the right question: What to monitor? You may want to monitor different things, whether you are collecting data for a developer audience or for customers who are more interested in an SLA.

Levels on which data collection happens:

  • Host level
  • Service level
  • Mesh level (referring to the service mesh, the networking layer in a sense)
  • Rack/Cluster/Pod/… level (higher levels, failure domains)

Most of the talk was pretty basic. Suggesting to use a time series database (what else?) However, there were interesting insights into how Facebook deals with monitoring challenges. They open sourced a couple of their tools, believe in structured logs, and are able to aggregate and query structured data using an internal tool called SCUBA. 📈

My highlight of the talk: War stories from Facebook Production Engineering. But I won’t spoiler those, watch the recording once it is out. 🤫

Money Quotes:

  • Data hopefully become the lingua franca in your engineering organization.
  • Monitoring should be like git: Init on project start and be there for the whole lifecycle.
  • Do not wake up people for noise.

Link to the talk: Service Monitoring Manual - 2018 Edition

Doing Things the Hard Way

The more forgiving right-after-lunch time slot was taken by Chris Sinjakli from GoCardless. He did not need any forgiveness for the talk’s content which was great. But the AV wasn’t forgiving of his USB-C MacBook. I gave him my older MacBook for the presentation and used his shiny new one to take notes. (I want my old keyboard back…)

The dangers of hiring a DevOps engineer when you have an infrastructure problem: It creates a new bottleneck, as everything goes through DevOps then. Make contributions to infrastructure easier. Make it obvious for developers what and how to change to modify the infrastructure. That enabled developers to contribute to the infrastructure code. So when hiring someone for infrastructure make sure they have a developer background.

Observability pays off in the longer term. It has to permeate everything you do to provide more value. Results include:

  • Faster debugging
  • Shorter outages

Another point I took home was: Once you change the core of your infrastructure, you may end up with an Everything project. A change that touches everything risks not changing anything at all in the end. So where to start? Stop building with the new world in mind. Build the smallest version possible.

Money Quotes:

  • In reality, the hard problems are not necessarily the most important problems.
  • Features are not done when shipped, but done when measured.
  • The one leap into the perfect infrastructure is ludacris.
  • Do not rewrite everything from scratch.
  • You won’t avoid every mistake. It’s perfectly fine to correct…

Link to the talk: Doing Things the Hard Way

Achieving Observability into Your Application with OpenCensus

OpenCensus developer and former Google SRE Emil Mikulic introduced the OpenCensus framework. My team recently started OpenCensus in new Golang microservices and we love it. The talk was about distributed tracing, explaining traces and spans. That for good propagation you have to generate the Trace ID and Span IDs as early as possible. This metadata is then propagated using HTTP headers. (I use gRPC often and get this for free there. Can highly recommend!) One probably wants to add application-level metrics (e.g. queue lengths) to the data that comes out of OpenCensus.

If you just starting with tracing, look into OpenCensus. I think it is the new standard and we use it all the time on my team.

There was a cool demo. The code is on GitHub.

Link to the talk: Achieving Observability into Your Application with OpenCensus

Comprehensive Container-Based Service Monitoring with Kubernetes and ISTIO

Being a huge fan of ISTIO, I had to go to Fred Moyer’s talk about Kubernetes and ISTIO. Fred works for Circonus. Fun fact, he wrote the very first ISTIO adapter and got awarded with a ship in a bottle for that. ⛴

After a quick overview of the ISTIO components, Fred demonstrated the book shop example app. If you have, like me, played a bit with ISTIO already this specific part of the talk will not provide too many new insights. I liked that he put the kubectl output on the slides rather than showing them in a small terminal window. That makes it more approachable to people watching the recording later.

Much has been said about the Four Golden Signals. Fred showed how a different set of metrics, called RED (stands for Rate, Error, Duration) that can be gathered with ISTIO:

  • Rate: We have the number of requests and also get the ops per second on the ISTIO standard dashboard. That was easy!
  • Errors: We have the number of requests by HTTP status code. From that, we can derive the errors easily.
  • Duration: The best approximation may be the request duration percentiles. However, there are some dangers to that. They are an aggregated metric and may hide some bad tail.

The way to go for measuring durations may be the histogram. Histograms make some effects visible that would be hidden by percentiles. Also use heatmaps, of course. I love heat maps! I learned that writing custom metrics adapters for ISTIO is not very hard.

Fun story: With the metrics that ISTIO provides, we can measure the number of rage clicks (user-induced retries). An indirect indicator of customer satisfaction. 😂

If you deal with SLIs or SLOs, you want to watch this talk. Highly recommended!

Money Quotes:

  • Percentiles are an output, not an input!
  • If you work with percentiles as SLI, ask yourself: Can you do better?
  • The code is hosted on Microsoft… opens Good one! 🙃
  • Monitor services, not containers!

Link to the talk: Comprehensive Container-Based Service Monitoring with Kubernetes and ISTIO

Randomized Load Balancing, Caching, and Big-O-Math

Julius Plenz from Google started with letting us know that he won’t do the hard math on the slides but rather use visualizations. Very much appreciated! He started with bins of servers receiving requests. With random load balancing, those requests are not uniformly distributed. So we derive a metric from that called peak-to-average ratio. We have to provision for peak load, so the natural thing to do is to reduce the peak-to-average ratio.

We can, with a high probability, predict the peak value for a server. One way to reduce the peak-to-average ratio is to scale vertically instead of horizontally. That’s not always possible, though. When you scale horizontally, the peak-to-average ratio becomes statistically worse. Typical peak-to-average ratios typically range from 1.25 to 1.4. The more you scale up your systems, the worse it gets (if you provision for peak load).

Money quotes:

  • Math to the rescue!
  • Don’t scale instances with traffic 1:1.
  • Moore’s law makes non-linear scaling more affordable over time.
  • Randomized load balancing is good if you have many things.
  • Randomized load balancing becomes worse if you scale your system in the wrong way.
  • Pay attention to the size of the (frontend) cache.

From the Q&A:

  • Usually, we can not scale sublinearly. So the question here is not how to scale sublinearly, but how to design the system to not scale too greedy above linear.
  • There are better load balancing strategies than randomization. However, beware of feedback loops! This is, in the end, an engineering question: How much are you willing to sacrifice another roundtrip to learn about a server’s load before sending a request there.

Link to the talk: Randomized Load Balancing, Caching, and Big-O-Math

Cultural Nuance and Effective Collaboration for Multicultural Teams

Another talk I was super excited about. I spent the better part of my career in the military. While I learned some unique crisis solving skills there, working in a multicultural team was not a strong focus in that environment. Unless you consider the western-dominated NATO a multicultural institution.

Ayyappadas Ravindran from LinkedIn presented three stories of intercultural experiences from his career. I am going to spoiler one of them and leave the other two for the interested reader to check out by watching the recording.

Ayyappadas had his first one-on-one with his manager via phone. His manager always asked What do you mean? when they talked. That made him feel insulted. Was the manager thinking he was not capable of understanding what he was talking about? All of that was perceived as rude by Ayyappadas. When he met his manager in person, however, he learned that his manager was a really nice person and held a very high opinion of Ayyappadas. How come? The key is the cultural difference here: His manager, coming from a low context culture, really wanted to know what Ayyappadas meant when he asked his question. But Ayyappadas, coming from a high context culture, interpreted the question and understood it in a very different way.

I can highly recommend this talk!

Money quotes:

  • Look for what people mean and not what people say.
  • When in doubt, ask and do not assume.

Link to the talk: Cultural Nuance and Effective Collaboration for Multicultural Teams

My Summary

This is my personal summary which comes without further explanation. Think of this as a note to myself that accidentally went public:

  • Develop SRE to become a partner in crime with the devs, not police their code. Kind of hard when the code is barely production ready. No one said it will be easy, right?
  • ISTIO and OpenCensus are the way to go. I’m glad we are already on it and gaining experience with those frameworks.
  • Really cool how the community builds these flexible frameworks (Kubernetes, ISTIO, OpenCensus) which are inclusive of all kinds of underlying systems and connected TSDBs and log storage systems.
  • Histograms! We need more histograms! Don’t be afraid of non-uniform bin sizes.
  • Lee Kuan Yew, the founding father of Singapore, once claimed that air conditioning enabled Singapore’s success as much as multicultural tolerance. But does that mean every room must be chilled down that much? I did not pack warm clothes, but I wish I had. I think it is freezing cold in the conference rooms. ⛄️