It is SREcon time again! This iteration takes places in beautiful Singapore! 🌴I flew in yesterday in a half-empty A380. First time A380 (amazing piece of engineering!), first time visiting Asia. It’s also the first time I’m staying in the same hotel where the conference I’m attending takes place. Not the first time I’m investing my savings into getting up to speed with a new role, but maybe the first time I was very generous to myself. Anyway, here’s my report from day one.
The organizers urged us, attendees, to show up early to pick up badges and have breakfast. A minor hiccup with the network prevented the downloading of the conference badges so we all headed to the much-needed coffee and breakfast first. Some, I learned, flew in at midnight and were up again at 7 am to help to organize. Wow. Badge pickup started a few minutes later than planned but as far as I understand was still within the agreed-on SLO range. 😉
Paul Cowan and Xiao Li welcomed us and I learned this is the second SREcon in Asia. The organizers set up an on-call room for those of us who could not get ourselves out of rotation. Awesome service! A quick reminder of the USENIX Code of Conduct and the SREcon Slack followed. Some stats: 58 speakers and over 300 attendees this time. More companies, more diversity, and 2% more engineers. Hooray for engineers! 👩💻👩🔧
The Evolution of Site Reliability Engineering
Benjamin Purgason (LinkedIn) shared his experience with running an SRE team. When he joined the team, on-call was in-office and regular site outages were happening whenever the sun rose over California. It was incredibly helpful to learn that the big players had problems like this, it’s not only us. The founding principles for SRE at LinkedIn are:
- Site Up (website and backend services)
- Empower Developer Ownership
- Operations is an Engineering Problem (They don’t want heroic actions in Ops, but rather build reliable software in the first place.)
I learned about the evolutionary steps of an SRE:
- The Firefighter: Purely reactive, Incident Management all the time
- The Gatekeeper: Change control. Protect “our” (SRE) site from “them” (Software Engineers). It is an evolutionary dead end, a team can get stuck in there. Don’t to that!
- The Advocate: Creating a reliability culture. Rebuilding trusted relationships. Still reactive to Software Engineering plans.
- The Partner: Empowering intelligent risk. Proactive and joint planning with Software Engineering. Collaborating to magnify the impact.
- The Engineer: Reliability throughout the software lifecycle. Proactive, once plan for SRE and SWE. Everyone has the same job: Help the company win.
- Every day is Monday in Operations.
- What gets measured gets fixed!
- If you solve your biggest problem every day, you start with 100 problems and still have 100 problems a year later. But they have a smaller scope by then.
- Human gatekeeping doesn’t scale.
- Attack the problem, not the person.
- There is no such thing as ‘the hole is in your side of the boat.’ (Fred Kofman)
- How do you want to spend your time? Help me build a reliable site or help me at 3 am in the morning fighting the fire?
- Do not insulate, share the pain.
- Contribute where it counts.
- Unify SWE and SRE planning and priorities.
Link to the talk: The Evolution of Site Reliability Engineering
Safe Client Behavior
Ariel Goh from Google Sydney dug into the problem of handling over 2bn Android clients with a significantly lower number of servers. Essentially, safe client behavior means Do Not DDoS. Unsafe requests include periodic retries which are not safeguarded by proper backoffs and unintentional syncing. The worst thing that can happen is the backend (servers) going down. Here’s what Ariel suggested for safe client behavior:
- Add jitter to client code, do not sync periodically without having at least some randomness in the backoff time.
- A synchronized startup does not seem like a problem, because not everyone starts their app at the same time, right? Well, some apps do background tasks that are bound to a specific time. E.g. synchronize at 4 am in the morning. Adding a jitter to the startup can help here.
- Do not retry by default!
- Retry with jitter and capped, exponential backoff and you are a much better behaving citizen.
- Do not retry on out of quota or client errors (e.g. HTTP 400 errors)
- Do (carefully) retry on networks and server errors (e.g. HTTP 500 errors)
Retry-Afterheader in client and server.
- Improve debugging by adding tags to requests including client name and version, the feature that triggered the request, if the request is the initial one or a retry.
- On the server side: Prioritize interactive requests over background requests.
- Additional tips for microservices: Have retry budgets and adaptive throttling. (The reasoning here is, that microservices in your managed infrastructure probably have more insight into the state of the overall system than some random clients out there in the wild.)
Example code for adding jitter:
Make sure to get your hands on the slides once they are published. A lot of graphs in there showing the effects of different variants of jitter and backoff code. Eye opening!
Ariel summarized the talk as follows:
- Jitter everything
- Don’t retry
- If you retry, back off
- Move control to the server
- Expose info to the server
- Use retry budgets
- Use adaptive throttling
Link to the talk: Safe Client Behavior
Service Monitoring Manual - 2018 Edition
Nikola Dipanov from Facebook’s Production Engineering talked about monitoring in production. First, we have to ask the right question: What to monitor? You may want to monitor different things, whether you are collecting data for a developer audience or for customers who are more interested in an SLA.
Levels on which data collection happens:
- Host level
- Service level
- Mesh level (referring to the service mesh, the networking layer in a sense)
- Rack/Cluster/Pod/… level (higher levels, failure domains)
Most of the talk was pretty basic. Suggesting to use a time series database (what else?) However, there were interesting insights into how Facebook deals with monitoring challenges. They open sourced a couple of their tools, believe in structured logs, and are able to aggregate and query structured data using an internal tool called SCUBA. 📈
My highlight of the talk: War stories from Facebook Production Engineering. But I won’t spoiler those, watch the recording once it is out. 🤫
- Data hopefully become the lingua franca in your engineering organization.
- Monitoring should be like git: Init on project start and be there for the whole lifecycle.
- Do not wake up people for noise.
Link to the talk: Service Monitoring Manual - 2018 Edition
Doing Things the Hard Way
The more forgiving right-after-lunch time slot was taken by Chris Sinjakli from GoCardless. He did not need any forgiveness for the talk’s content which was great. But the AV wasn’t forgiving of his USB-C MacBook. I gave him my older MacBook for the presentation and used his shiny new one to take notes. (I want my old keyboard back…)
The dangers of hiring a DevOps engineer when you have an infrastructure problem: It creates a new bottleneck, as everything goes through DevOps then. Make contributions to infrastructure easier. Make it obvious for developers what and how to change to modify the infrastructure. That enabled developers to contribute to the infrastructure code. So when hiring someone for infrastructure make sure they have a developer background.
Observability pays off in the longer term. It has to permeate everything you do to provide more value. Results include:
- Faster debugging
- Shorter outages
Another point I took home was: Once you change the core of your infrastructure, you may end up with an Everything project. A change that touches everything risks not changing anything at all in the end. So where to start? Stop building with the new world in mind. Build the smallest version possible.
- In reality, the hard problems are not necessarily the most important problems.
- Features are not done when shipped, but done when measured.
- The one leap into the perfect infrastructure is ludacris.
- Do not rewrite everything from scratch.
- You won’t avoid every mistake. It’s perfectly fine to correct…
Link to the talk: Doing Things the Hard Way
Achieving Observability into Your Application with OpenCensus
OpenCensus developer and former Google SRE Emil Mikulic introduced the OpenCensus framework. My team recently started OpenCensus in new Golang microservices and we love it. The talk was about distributed tracing, explaining traces and spans. That for good propagation you have to generate the Trace ID and Span IDs as early as possible. This metadata is then propagated using HTTP headers. (I use gRPC often and get this for free there. Can highly recommend!) One probably wants to add application-level metrics (e.g. queue lengths) to the data that comes out of OpenCensus.
If you just starting with tracing, look into OpenCensus. I think it is the new standard and we use it all the time on my team.
There was a cool demo. The code is on GitHub.
Link to the talk: Achieving Observability into Your Application with OpenCensus
Comprehensive Container-Based Service Monitoring with Kubernetes and ISTIO
Being a huge fan of ISTIO, I had to go to Fred Moyer’s talk about Kubernetes and ISTIO. Fred works for Circonus. Fun fact, he wrote the very first ISTIO adapter and got awarded with a ship in a bottle for that. ⛴
After a quick overview of the ISTIO components, Fred demonstrated the book shop example app. If you have, like me,
played a bit with ISTIO already this specific part of the talk will not provide too many
new insights. I liked that he put the
kubectl output on the slides rather than
showing them in a small terminal window. That makes it more approachable to people watching
the recording later.
Much has been said about the Four Golden Signals. Fred showed how a different set of metrics, called RED (stands for Rate, Error, Duration) that can be gathered with ISTIO:
- Rate: We have the number of requests and also get the ops per second on the ISTIO standard dashboard. That was easy!
- Errors: We have the number of requests by HTTP status code. From that, we can derive the errors easily.
- Duration: The best approximation may be the request duration percentiles. However, there are some dangers to that. They are an aggregated metric and may hide some bad tail.
The way to go for measuring durations may be the histogram. Histograms make some effects visible that would be hidden by percentiles. Also use heatmaps, of course. I love heat maps! I learned that writing custom metrics adapters for ISTIO is not very hard.
Fun story: With the metrics that ISTIO provides, we can measure the number of rage clicks (user-induced retries). An indirect indicator of customer satisfaction. 😂
If you deal with SLIs or SLOs, you want to watch this talk. Highly recommended!
- Percentiles are an output, not an input!
- If you work with percentiles as SLI, ask yourself: Can you do better?
- The code is hosted on Microsoft… opens github.com Good one! 🙃
- Monitor services, not containers!
Link to the talk: Comprehensive Container-Based Service Monitoring with Kubernetes and ISTIO
Randomized Load Balancing, Caching, and Big-O-Math
Julius Plenz from Google started with letting us know that he won’t do the hard math on the slides but rather use visualizations. Very much appreciated! He started with bins of servers receiving requests. With random load balancing, those requests are not uniformly distributed. So we derive a metric from that called peak-to-average ratio. We have to provision for peak load, so the natural thing to do is to reduce the peak-to-average ratio.
We can, with a high probability, predict the peak value for a server. One way to reduce the peak-to-average ratio is to scale vertically instead of horizontally. That’s not always possible, though. When you scale horizontally, the peak-to-average ratio becomes statistically worse. Typical peak-to-average ratios typically range from 1.25 to 1.4. The more you scale up your systems, the worse it gets (if you provision for peak load).
- Math to the rescue!
- Don’t scale instances with traffic 1:1.
- Moore’s law makes non-linear scaling more affordable over time.
- Randomized load balancing is good if you have many things.
- Randomized load balancing becomes worse if you scale your system in the wrong way.
- Pay attention to the size of the (frontend) cache.
From the Q&A:
- Usually, we can not scale sublinearly. So the question here is not how to scale sublinearly, but how to design the system to not scale too greedy above linear.
- There are better load balancing strategies than randomization. However, beware of feedback loops! This is, in the end, an engineering question: How much are you willing to sacrifice another roundtrip to learn about a server’s load before sending a request there.
Link to the talk: Randomized Load Balancing, Caching, and Big-O-Math
Cultural Nuance and Effective Collaboration for Multicultural Teams
Another talk I was super excited about. I spent the better part of my career in the military. While I learned some unique crisis solving skills there, working in a multicultural team was not a strong focus in that environment. Unless you consider the western-dominated NATO a multicultural institution.
Ayyappadas Ravindran from LinkedIn presented three stories of intercultural experiences from his career. I am going to spoiler one of them and leave the other two for the interested reader to check out by watching the recording.
Ayyappadas had his first one-on-one with his manager via phone. His manager always asked What do you mean? when they talked. That made him feel insulted. Was the manager thinking he was not capable of understanding what he was talking about? All of that was perceived as rude by Ayyappadas. When he met his manager in person, however, he learned that his manager was a really nice person and held a very high opinion of Ayyappadas. How come? The key is the cultural difference here: His manager, coming from a low context culture, really wanted to know what Ayyappadas meant when he asked his question. But Ayyappadas, coming from a high context culture, interpreted the question and understood it in a very different way.
I can highly recommend this talk!
- Look for what people mean and not what people say.
- When in doubt, ask and do not assume.
Link to the talk: Cultural Nuance and Effective Collaboration for Multicultural Teams
This is my personal summary which comes without further explanation. Think of this as a note to myself that accidentally went public:
- Develop SRE to become a partner in crime with the devs, not police their code. Kind of hard when the code is barely production ready. No one said it will be easy, right?
- ISTIO and OpenCensus are the way to go. I’m glad we are already on it and gaining experience with those frameworks.
- Really cool how the community builds these flexible frameworks (Kubernetes, ISTIO, OpenCensus) which are inclusive of all kinds of underlying systems and connected TSDBs and log storage systems.
- Histograms! We need more histograms! Don’t be afraid of non-uniform bin sizes.
- Lee Kuan Yew, the founding father of Singapore, once claimed that air conditioning enabled Singapore’s success as much as multicultural tolerance. But does that mean every room must be chilled down that much? I did not pack warm clothes, but I wish I had. I think it is freezing cold in the conference rooms. ⛄️
The coffee is still great and still much needed. ☕️An evil twist by the kitchen crew: They changed places of the tea and the coffee dispenser.😳 A lot of jumping rooms today because the most interesting talks always happen in the other room, right? Keeps the blood flowing and after all, I am in the fitness industry.
Automatic Data Center and Service Deployments Based on Capacity Planning Artifacts
Xiaoxiang Jian from the Alibaba Group reported on the difficulties they faced with deploying data centers, including:
- Capacity planning
- Server planning
- Network cabling
- Bootstrapping of network, operating systems, and services
To give you some scale, we are talking about deploying tens to hundreds of data centers per month. According to Xiaoxiang this later went up to tens to hundreds of data centers per week. Impressive. Where are all these data centers?
Their solution to the problem was defining artifacts and thus building an immutable infrastructure. We see this pattern all over the industry: Definition wins. It enables rollbacks, version controlled updates, and configuration lockdowns. Think of Kubernetes for data centers. Terraform comes to mind. But this goes a little bit deeper down the stack. Alibaba is baking images that end up on bare metal. They treat hardware as a black box which’s final state is defined by software. We’ve done similar things in the old days by leveraging PXE boot and pre-baked images, but back then we would have deployed the configuration via Puppet or Chef. Alibaba does the configuration upfront and delivers the final image to the hardware. Kind of nice, isn’t it?
The interesting part now is, how do you define a data center? They are using a two-phase approach:
- Business planning: Categorize the services, plan capacity based on needs, plan the network
- Delivery planning: Generate the network and operating system configurations
The artifacts then are:
- Product: The final delivery for the business
- Service: A software concept deployed on a cluster
- Application: The real thing, runs as a process on a server
In practice, this could be a shop (product) that has a website (service) that runs on a Tomcat (application). The same principle can be applied to heavier products, such as Elastic Computing or Block Storage. Obviously, this means defining a lot of dependencies, such as particular node capabilities. You don’t want to run compute on a machine designed for storage. I’m probably wrong here, but I was just asking myself if that doesn’t bring dependency hell to the hardware world and whether this is a good or a bad thing. 🤔
The results, however, are quite impressive: Datacenter bootstrapping went down to 24 hours from previously 2 months (for most of the data centers).
Link to the talk: Automatic Data Center and Service Deployments Based on Capacity Planning Artifacts
Ensuring Reliability of High-Performance Applications
Anoop Nayak from LinkedIn started his talk with some interesting data on the status of the Internet in India:
- 79% of users access the Internet through a mobile device
- 85% use Android
- 75% use Google Chrome
The 99th percentile of the LinkedIn page load in India was approximately 24 seconds. They were aiming for 6 seconds. Their approach was to create a LinkedIn Lite website. Actions to get there included:
- Reduce the size of the mobile page, targeting a size under 75 KB
- No client-side frameworks to increase page paint
- Avoid client-side redirects, which account for approximately 2 seconds each on a slow network
- Leverage Server-Side Rendering (only sent necessary HTML content to the client)
- Early flushing sends HTTP headers to the client while the server is still rendering the HTTP body
In the end, LinkedIn Lite is an app of about 1 MB in size that wraps a web browser tab. Now, how to monitor that? A lot of monitoring is happening on the server side. Additionally, some metrics, such as client-side load times, can be extracted from the DOM. A few more metrics come from a small, custom library.
Another cool thing that they did is using service
workers. Service workers are like background threads, but for the web. This can make a
website feel like a native app. A word of caution: A service worker running wild can render
the whole app useless. So having a kill switch for service workers is essential. LinkedIn
controlled service worker behavior by setting the cache control HTTP header to
private, max-age=0. This forces the service worker to throw away the cache,
which can persist for up to 24 hours.
- Status codes 3xx, 4xx, 5xx are all there. Monitor them!
- Page load times must be tracked and monitored.
- Service workers need a kill switch.
- Web views can break. All the phones need testing.
Link to the talk: Ensuring Reliability of High-Performance Applications
Debugging at Scale - Going from Single Box to Production
Kumar Srinivasamurthy from Microsoft (Bing and Cortana Engineering) started off with the history of the computer bug and then quickly went to current tracing tools like Zipkin and BPerf.
And then he showed some cool ideas:
- Use machine learning classifiers to analyze log data to find negative wording. E.g. sentences like action took too long indicate a problem.
- Anomaly detection at scale.
- Near real-time metrics to detect problems earlier.
- Strip personal data from log files. Did someone say GDPR?
On a hack day at Microsoft, they created a Hololens SRE tool (all in prototype stage). You need to see this! We had some good laughs in the audience. Yes, I can imagine doing my job with a Hololens one day. Cool thing!
Link to the talk: Debugging at Scale - Going from Single Box to Production
Productionizing Machine-Learning Services: Lessons from Google SRE
Google SREs still fill the rooms, even if it is after lunch. Salim Virji and Carlos Villavieja shared their lessons learned when they wanted to apply machine learning to production
Machine Learning is good for everything, except when:
- There is no fallback plan
- There is not enough labeled data
- One requires microsecond reaction time
Unsurprisingly, machine learning is used in almost every Google product. One of their most important models is the YouTube video recommendation model. Which comes with its own challenges, such as seasonal peaks of topics (Super Bowl), spam videos slipping into the training data, and regional popularity of videos.
Is an ML model just another data pipeline? Can we just run it like any other pipeline? Unfortunately, the answer is no.
Training and data quality: SREs run the models and the training in production because training is part of the production lifecycle of the model. New data comes in all the time, models need to evolve fast. Since the data quality is essential SREs have to filter and impute data to avoid spam and overfitting. Snapshotting the model and warm starting helps to deal with varying compute resources. When input data pipelines are not balanced, e.g. due to an outage in a region, the model may develop a cultural bias towards the other regions. Google also leverages parallel training and then decides which output model to put into production. 🧐
Allocation of hardware resources (GPU, TPU): Google produces a new TPU version every year. Nevertheless, the cost of training grows at a higher rate than production resources. Currently, there is a lack of reliable multi-tenancy in parts of the training infrastructure (if I understood that right?). Models are tested with the same binaries as in production, but there are still canaries. Canaries shall ensure that the new model behaves similarly to the old model. A completely different behavior would indicate a problem. Models then get signed before they end up in production. That’s cool!
Models come with their own set of problems. For example, if a model features new labels, you can not roll that back. The only way is to re-train the old model with the new labels. Not being able to rollback makes on-call life significantly harder I assume. 🤨
War story: A particular demographic reacted differently to a new model compared to the old one. Fewer clicks (loss of revenue) was the result. The issue was solved by monitoring models and alert in cases like this. I wish the speakers would be gone into more detail here. Sounds very interesting.
According to the speakers, ethics in machine learning is the big elephant in the room. So SREs at Google are able to stop machine learning predictions when a model behaves unethically. Experts also call for independent oversight. For example, the AI Now Institute does that. Running predictions in an ethical and fair way is very important to SRE. This means SREs must always be able to stop any prediction that goes havoc. Essentially, SREs must be root on any model. The fact that models are signed before they go into production hints at how important (or how advanced) ML models are at Google. 🤖
Fun fact: There is a YouTube model that is over 1TB in size. Woah! 🤯
From the Q&A: How to start with SRE for ML models?
- Start with a very small model
- Have the model spend a long time in canary
- Have a data scientist ready when the model goes into production for the first time
- Have a rollback plan that includes entirely removing the model from production
Link to the talk: Productionizing Machine-Learning Services: Lessons from Google SRE
How to Serve and Protect (with Client Isolation)
From the Google Maps SRE team, Frances Johnson was reporting about client isolation. Maps has a lot of customers, internal and external. The Google Assistant, for example, is an internal customer to Maps while me using my phone is considered an external request. Unexpectedly, Maps had monoliths and overload situations. Something I am way too familiar with in my current job.
Goals of the client isolation initiative:
- Clients should not be able to hurt others
- Graceful degrade the service in an overload situation
- If you have to drop queries, be smart about which ones
Strategies they came up with:
- Caching: Cached queries are cheap, adding caching is often easy, but not possible for all queries (billing, strict consistency requirements, …)
- Quotas: Fun fact about quotas: Clients think a quota is a guarantee. Services think it is an upper limit. This can lead to over-subscription. They observed 7x over-subscription at some point in time.
- Load Shedding: Not all traffic is created equal. Background and batch jobs are less critical than a waiting user’s request. Always drop the lower priority requests first.
- Microservices: Not their largest problem, but they split up their monoliths.
- Separate Stacks: Maximum isolation. Everyone gets their own. Doesn’t scale too good and produces quite some toil.
- If everything is the highest priority, then nothing is.
- We put it there because we did not want to write another service and hand it over to SRE. (Teams on why they put more stuff into a monolith)
- Can you just exempt my super-important client? (User’s on client isolation)
- A client big enough to ask for an exemption from client isolation is probably also big enough to damage your service. (So, the answer must always be: No exemption!)
- Understand your queries and prioritize them accordingly!
Link to the talk: How to Serve and Protect (with Client Isolation)
A Tale of One Billion Time Series
Ruiyao Yao from Baidu talked about the monitoring systems in use at Baidu and the challenges they faced with Time Series Databases (TSDB). He started off with an example that I had a hard time following, but the money quote is: When you cannot reach www.baidu.com in China, your home network is broken!
If I understood Ruiyao correctly, Baidu is aiming for always up and that is the reason they invest so much in their monitoring. Monitoring data is furthermore used for capacity planning and troubleshooting. How much data are we talking about?
- Millions of targets
- 600+ metrics per target
- 1B time series in total
- 50TB written per day
- 10M read/write requests per second
- 40M data points in per second
- 60M data points out per second
- 50Gbps write and 100Gbps read
Writing time series is based on a log-structured merge tree. On top of that, they are using tables with a time to live (TTL) to expire data. There are data tables and metadata tables. Tags and indexes end up in the metadata table. The on-disk layout reminds me a bit of SSTables, but maybe I got the speaker wrong here. I had a hard time following the content at some points. Here is a slide with the layout.
I liked it once I understood why query latency is so important to Baidu: They run anomaly detection in real time based on metrics stored in the TSDB. For low-resolution data, such as trends over an hour, they pre-aggregate the data online in the TSDB called multi-level down-sampling. Down-sampled data is of smaller volume and can be stored longer, another benefit of down-sampling. If that wasn’t enough, here is another optimization: Users can define key metrics and the system also identifies so-called hot metrics. These metrics are then cached in Redis for even faster access.
Typical for the used on-disk data format are many disk seeks. That was partly solved by compacting the files. As files grow too large, they are split and distributed to different systems. Reading can now happen in parallel on multiple systems. Nevertheless, compactions are still expensive. Thanks to the JVM, additional fun is provided by stop-the-world events. (I have a very difficult relationship with the JVM 😇)
This talk was really interesting and digging deeper into the interesting engineering problems as it progressed.
- There are latency-sensitive and latency-insensitive queries to the TSDB. Treat them differently to optimize for each type of query.
- People like to query over all the hosts or the whole year.
- Unfortunately, our TSDB stack uses Java as the main language. 😎
Link to the talk: A Tale of One Billion Time Series
Isolation without Containers
Tyler McMullen from Fastly shared his thoughts on isolation on bare-metal. From
the broader topic of general isolation, we quickly went via Fault Detection, Isolation
(from a control engineering perspective), and Recovery into how processes are managed by
the Kernel. A process is basically memory (containing data and code) and metadata in the
Kernel (think Linux
task_struct). Unsurprisingly, containers are just
processes with applied resource isolation via namespaces.
And here is why Fastly is interested in this: High performance systems with many small tenants and strict latency requirements may find VMs, containers, and even processes all far too heavyweight.
To achieve isolation without using all these technologies, one just has to make
sure the control flow and the data is understood and cannot run havoc. Easy, right?
Interestingly, Webassembly meets these requirements as it is bound checked and cannot load
random libraries (contrary to
dlopen()). Because we ran out of time, Tyler had
to skip over the most interesting part of the talk. But essentially Fastly found a way to
compile multi-tenant Webassembly code into a process that can run safely on bare metal.
Wrap your head around this! I did enjoy this talk very much as it was more low level than
the others talks from that day. I love low level talks.
- Fault isolation is really about reducing the set of possible faults to a knowable, recoverable set.
- Everything in the memory is basically the wild west. 🤠
Link to the talk: Isolation without Containers
This is my personal summary which comes without further explanation. Think of this as a note to myself that accidentally went public:
- Defined state wins over doing something.
- One can build fast websites and apps for slow networks, but trade-offs must be made.
- SRE’ing machine learning models is a whole field of its own. Highly interesting, I want to learn that! Comes with high responsibility, which is something I like.
- Client isolation is important. Luckily there are plenty of techniques and strategies. But implementing only a single one is never enough.
- Scaling a TSDB isn’t trivial but comes with a ton of interesting problems.
- Webassembly is awesome. And safely abusable. 😅
Also, this happened: I won a selfie drone at a raffle at the LinkedIn booth. Thank you so much, guys! I love it!
Last day of SREcon Asia 2018. Did I mention how important coffee is to get me started on a conference day? No? Coffee is much-needed. And the coffee here is great. ☕️😜
A word of warning: Today’s talks were packed with information. The report got very long. I apologize in advance!
Interviewing for Systems Design Skills
Sebastian Kirsch from Google Switzerland conducted over 200 interviews, was member of the hiring committee, and trained engineers in interviewing. We can check off credibility at this point I think. When they started hiring SREs in the early days of that role there was no job description because the job did not exist before.
- Design a system for us in the interview
- Be aware of performance needs, resource needs, and bottlenecks
- Design from scratch for problems that have not been solved or existed before
- They need to know why each selected component is a good fit
Other system aspects in regards to manageability and lifecycle:
- How to maintain the components
- How to scale the system
- How to upgrade the systems (versioning)
- How to migrate data or the system
In an interview your time is limited. So they can’t go into too much details and complexity. But there is a trick: Build a simple system and scale it up. Even simple systems become complex within minutes by just scaling them up. That can be used to talk about complexity even in time-bounded interviews.
Example question: Design a system to copy a file to some machines. Easy, right?
Just do a
for loop and
rsync the file to the target system. Now,
scale! Ludacris requirements are allowed:
- File is 100GB in size
- Once a day
- 100k target machines
- Machines are placed on the moon (uh, latency!)
- Source machine has a 100Mbit NIC (and here goes sequential file transfering down the sink. Multicast anyone?)
Sebastian didn’t solve the puzzle in his talk, as this was only a demonstration how easy it is to scale a simple question into a complex engineering challenge.
How would one design such a question? Start with a very simple problem. Copy something. Process some data. Pass a message. It should be something from an area where you are familiar with. And then just scale up. Add a crazy requirement so that the problem needs at least a hundred thousand machines to be solved. Next step then is to find the bottlenecks in the system. Typical bottlenecks are disk IO, network bandwidth, disk seek times. A good question has a couple of bottlenecks. Ideally, it should not be clear from the beginning what the bottleneck dimension is. Fun fact: Designing these questions becomes harder over time as hardware contains more magic nowadays. Compared to the past, current computers are supercomputers. Having tons of cores and unreasonable amounts of memory. A lot of problems fit into memory of a 96 core and 1.5TB RAM machine. And that is something that you can just click on Google Cloud Platform. Oh, and disk seeks? You can buy guaranteed IOPs on SSD on the cloud. Snap! Hardware takes all the fun out of engineering problems. Sebastian suggested to just make the problem bigger (oh, did I mention that we want to serve all humans and extraterrestrials? Design for that!) The other option the interviewer has is to limit the candidates option of machines. I like the first option better. We have IPv6, so multi-planet distributed systems should be at least addressable by now.
Another challenge for the interviewer is that at some point maths (e.g. total bandwidth calculation) is going to happen. That has to be checked, while the interviewer has to take notes and think about the next steps already. Solution: Have a cheat sheet with pre-calculated values for the anticipated solutions. Also on the cheat sheet: Possible follow-up questions.
Practice: Never go into an interview with a question that you have not tested. Testing on an external candidate is neither fair nor gives it the required amount of signal. The first iteration is probably way too simple or way too complex. The problem description could be misleading. Practice can iron out all of these problems and calibrate your expectations. Test the questions with your coworkers. Highly relatable! I remember how surprised I was how long it takes to solve one of our coding questions when you never seen it before. I once spend an hour with a candidate sorting an array and still got a ton of signal out of that.
The hiring bar: The problem is not really whether or not a candidate is below or above the hiring bar. The problem is more that we do not know what the candidate’s skill level really is. So the hiring bar problem is more an uncertainty problem. There is a huge error bar or grey area. A lot of candidate will touch the hiring bar at some point. The interviewer’s job is to decrease uncertainty and become certain whether or not a candidate’s skill levels are below or above the hiring bar. It is a classification problem with a lot of uncertainty essentially.
Interview time must be used to decrease uncertainty. Everything else is just time wasted. To gain as much signal as possible, start with making clear what the expectations are. This includes technical requirements. Furthermore, it includes making clear that you expect reasoning about choices and questions from the candidate to the interviewer. Careful steering enables the interviewer to use the time more effectively.
Hints from the practice:
- Many candidates forget to provide concrete resource estimates (e.g. number of drives needed). Question: What goes on the purchase order for this system?
- Sometimes there are no clear boundaries between the systems and how they work together. Question: You have N teams of engineers that are going to implement the system. What entities should each team work on?
- Candidates may get stuck in precise arithmetics. So a hint could be to give roundings: A day consists of N seconds. Use that value to proceed. In the end, it is all about orders of magnitude.
- The magic bullet: Candidates use a standard technology because they think we ask for knowledge. Question: How does this technology work?
How do we determine how well a candidate did?
Go into the interview with expectations. Compare the results with the expectations. Have expectations for each axis or dimension you are looking for. Does the designed system solve the problem? (Often, the system solves a different problem. Woot?) Good candidates trade resource dimensions for each other. E.g. use more CPU time to reduce network bandwidth. Did the candidate think about a possible SLO?
(Note: No photo here. Projector was broken and Sebastian ran the whole thing without slides. Playing it cool. Well done, sir!)
From the Q&A: Shall I read previous interview feedback on that candidate before going into the interview? Better not. It introduces an unfair bias and gives the first interviewers more weight in the process. At Google they do not propagate the actual feedback, but the topics, so that they minimize overlap without biasing the result.
Link to the talk: Interviewing for Systems Design Skills
Scaling Yourself for Managing Distributed Teams Delivering Reliable Services
Paul Greig is an SRE lead at Atlassian and talked about the challenges one faces in managing a distributed teams. We are talking about 18 SREs, two Team Leads who manage two services distributed across three geographic regions.
He started with the question what a distributed team is. His definition ranges from a single worker somewhere on the world to teams where everyone is in a different place collaborating over the Internet.
Benefits of a distributed teams:
- Better geographic coverage
- Talent can be hired regardless of where the talent wants to live
- Colocation: The SREs can be closer to the developers (Not sure I got this one correctly. I’d say it often is the opposite?!)
- Better On-call, because time zones.
Challenges of a distributed team:
- Friction and lack of cohesion: The divide leads to absence of communication sometimes.
- Duplication: When the right hand doesn’t know what the left hand does, work is often duplicated.
- Costs: Distributed teams are actually expensive. The impact may be a little lower than that of a team sitting in the same room.
- Imbalance: The day of a life in an engineer can be very different from that of another one. Paul visited his team members and found that they work very differently.
Now, how to scale yourself (as a leader) in a distributed environment? Three aspects:
Establish Trust. ”This is a huge one!” Paul said. Being present and asking questions (includes active listening) he built trust. Jumping to solving problems without establishing trust first doesn’t work. As leading a distributed team involves a lot of travel, you need to plan ahead to keep work and life in a healthy balance. Also, engineers sometimes like to travel to the HQ to mix and mingle. Scheduling one-on-ones at a time where it is suitable for the leader and the team member, who might be in a different time zone. Example: ”After I sent the kids to sleep I spent two hours on being present (online) for the team in a different time zone.” Allow teams some time to think about a problem and get to results. There is a time gap, naturally. However, at some point ask to wrap an issue up. It’s mostly about reasonable expectations. Participate rather than rushing the process.
Local team vs. remote team: Working with the local team the whole day: How can I make sure that I give the same amount of attention to remote team members? Plan for delays in pull requests. The velocity of big projects can slow down if you are waiting for feedback from a different time zone (and that time zone is asleep currently). Is separating projects and assigning it to different time zones a solution? Not really! So in the retro they were introducing end-of-quarter demos, regardless of the project state, there had to be a demo. That gave the teams a common goal and brought velocity back into the project and kept everyone on the same page.
Maintain balance for yourself. But also for your team. The team will only remain successful if they take care of their individual balance as well. Includes life work balance obviously Balance three aspects 👩🤚❤️: Head (mental), Hands (practical), and Heart (emotional)
- We saw disheartening duplicate solutions to the same problem. Great for redundancy, but expensive.
- Inspiration: I want to see a thirst from the engineers of the team to arrive at a fantastic outcome.
- We did not have these “sitting together at lunch time” sort of moments. So we played Keep talking so nobody explodes. (Also an awesome incident management communication training)
- I am not saying video games are the solution to every problem, but… 😂
- Listen, Ask, Tell
Paul published his Talk Resources
Link to the talk: Scaling Yourself for Managing Distributed Teams Delivering Reliable Services
Mentoring: A Newcomer’s Perspective
Leoren Tanyag from REA Group grew up with little exposure to technology which made her want to go into exactly this industry. She went through the graduate/internship program of REA and shared her experiences.
REA acknowledges that school is different and an un-mentored environment is not something they want their people to be exposed to. They communicate expectations on what to learn and where to start. Of course the mentee has a say in driving the direction. Emphasis is on pairing with co-workers.
Concept of the three sixes:
- On your first 6 days: I don’t know everything
- On your first 6 weeks: I know what I’m doing. No, you don’t!
- On your first 6 months: I may know what I need to know, now I have to continually improve.
Interestingly: Even after ten years people often look up to their mentors.
Leoren reports that mentees expectations are often not very high. They are so happy that they can learn from someone they respect. So there is no excuse for an experienced engineer not to mentor. Even of you are not very experienced, there is something you can share and a mentee can learn from you. I can not agree more. I, personally, expect sharing knowledge from every engineer in my team. Mentoring is one form of that and it is everyone’s responsibility to make every other team member succeed.
Forms of mentoring (different in structure and frequency):
- One on one mentoring: Highly tailored to a mentees needs
- Group Learning sessions: More efficient if multiple mentees share the same interest (e.g. learning a new technology). Also, some mentees feel less pressure in a group and enjoy having a group to talk to afterwards. Interesting aspect I have not thought about before!
- Casual Pairing: Supports bonding and transfer hands-on knowledge. Communication heavy, obviously.
- Good Mentors come from those that have been mentored before. Start the trend!
- Teaching is the best way to learn.
- Mentoring can help us grow (mentors become role models and may hold themselves to higher standards once they realize that)
From Q&A: How should mentoring relationships be started? Driven my mentee or mentor?
Leoren: We have a queue of people who agree to volunteer time. We use vouchers to get people in touch with each other.
Link to the talk: Mentoring: A Newcomer’s Perspective
(Unrelated side note: Was this Comic Sans in the slides? Looks like Comic Sans but then it also looks different. Why do I care about this at all? Why are fonts such an interesting thing?)
Blame. Language. Sharing: Three Tips for Learning from Incidents in Your Organization
Lindsay Holmwood from Envato started with story from when he was a teenager and was challenged with a serious illness. From there it went up and down (storywise) including topics like emergency room, cancer, surgery leftovers remaining in the body, and open heart surgery. There was a mistake made in the treatment and it was made sure that the person doing the mistake would not practice medicine anymore.
Years later, when he was running a cancer-related charity, he accidentally took down the website in the middle of one of the most important campaigns. He did not receive any punishment for that.
Comparing these two stories, he realized that how an organisation responds to failure and errors was different here. Some look for someone to blame, some don’t. The narrative for safety here is the absence of failure (e.g. in process, technology, people). If you think this through, this maintains a culture where any three of these can be a danger. People are a harm to safety here. Not necessarily a nice environment to work in, right? This all refers to a bi-modal operation mode. It is either safe, or it is not working at all.
In our industry, things are a bit different. Our systems are so large and complex, ther are always running in a partially degraded state. But they still work, often stil generating revenue, even though they are in a degraded state. That asks for a different culture. We have to embrace different perspectives, evn contradictory statements (think: views) about the state of a system can be both true at the same time.
The people doing the frontline work, in our culture, focus on quality and delivery. But they do not focus on covering their mistakes.
Lindsay then identified the three aspects of our culture he thinks are most important:
Why is a good word to use for talking about systems. Why did that action take so long? But it is a bad word to use in regards to people. It often carries some blame. Why didn’t you follow the runbooks? Be careful using Why, you do not want to question someones personality, right? How is better, but may be limiting the scope too much. How did that happen? What ties better with local rationality. The latter is a concept that assumes that people make trade-offs based on the limited information they had at the given time and in good intentions. Think about someone moving forward in a dark tunnel having a small flashlight. You never see the whole picture, only what you point the flashlight too. In hindsight, of course, things are always different. So what asks for how the tunnel looked like at the time you were inside it.
Thinking of people as hazards ignores one truth: Sometimes bad things happen and no one is to blame. Things go right more often than they go wrong. Finger pointing is basically a cognicitve bias. It’s something we have little control over. Our brain (perception) makes trade-offs between timeliness and accuracy all the time when processing information. That brings problem solving down to heuristics often. Lot of psychology here! He played a classic in confirmation bias: The monkey. Knowing that, think about how well we will do during an incident? Hooray for those who are not biased. Spoiler: No one immune to bias. Another thing to watch out for: In hindsight we tend to assign a failure to a person, not a system. The more negative the outcome is, the more biased we are to blaming a person, not a system.
Sharing (This part was so interesting, I forgot to take notes. Sorry!)
My takeaway: As a leader in SRE, it is your responsibility to create a psychological safe environment. That is quite in line with my personal beliefs and also what I just recently read in Leaders Eat Last. Is this bias in action? 🤔Oh, this is becoming meta now!🤪[But seriously, read the book and learn how we (ex-)military leaders are wired up internally and how we build teams where people are willing to risk their lives for each other!]
This is quite a long writeup of the talk, but, it is only a part of it. There was so much information packed into that hour. If you want to get the whole picture, bookmark the talk and wait for the video to be released. There you can also learn how Lindsay’s teenage story about life and death ended. I won’t spoil it here. (Ok, you obviously figured out that he survived. But the details matter!)
Link to the talk: Blame. Language. Sharing: Three Tips for Learning from Incidents in Your Organization
A Theory and Practice of Alerting with Service Level Objectives
Yesterday evening, at the reception, I asked Jamie Wilkinson (Google) a ton of question related to a tricky SLO measurement problem we face at work currently. Every now and then he would refer to his upcoming talk. So I went into this one with expectations set high.
He explained the differences of alerting on cause vs. alerting on symptons. While it is convenient to know exactly what went wrong (cause) it is much better to alert on symptons. Why? Because it results in less alerts and avoids alert fatigue much better. Every alert disturbs a persons life when on-call. And we care about people more than we care about a dying disk in the end, right?
Focus on alerting on a very small set of expectations that matter for the users of the service. Do not focus on alerting on all the nitty-gritty details, timeouts, and other metrics that may cause a problem.
How do we distinguish between a symptom and a cause? One rule is to ask yourself: Does this affect the users experience? For example, if you have reasonable latency, who cares about the depth of a queue in a backend system? No need to alert for that, we can solve it during office hours. Or maybe decide to not touch it at all, as it might just be fine.
There was also some discussion about where to measure time series and what kind of time series. For example, it is probably easy to increase request counters within the application. However, measuring at the load balancer enables us to also catch those crashing services as they become unavailable. I run a service at work where we use both. Mostly because we agreed to an unfortunate SLO that turned out to be extremely tricky to measure with the given tools. Even Matt suggested that we may be up for an adventure trying to visualize that specific SLO using the current feature set of Stackdriver. Setting SLOs is hard…
Back to the talk: Jamie introduced the concept of SLO burn rate. I like that very much. Alerting on the burn rate is something I definitely will bring back to my team in Munich.
- If we don’t use the error budget to our advantage, it will just be lost. Use the error budget!
- If a microservice crashes in the cloud and nobody notices, does it make a sound? 🤔😂(I’d add to this: Should it make a sound? Why? Why not?)
Link to the talk: A Theory and Practice of Alerting with Service Level Objectives
Production Engineering: Connect the Dots
Espen Roth talked about how graduates are onboarded in Production Engineering at Facebook. And what challenges come with that. He was hired by Facebook right after school, without a previous internship there, so he experienced the process firsthand.
New graduates build things from scratch in school. They ran small self-contained projects that they had to finish in time. They know about what is currently out there. What they lack is experience in maintaining a system on the long term. Teamwork is even more important in the job world and unlike school, in a job you are there for the long run. And sometimes you have little choice on what you have to work on. The grades don’t matter anymore, but impact does. Working in Production Engineering is very different from school.
As a company interviewing graduates, you want to look at three aspects:
- Capacity (skills, learning)
Internships are the ultimate interviews. Internships provide a long exposure to feedback and hiring a former intern requires little ramp-up.
Link to the talk: Production Engineering: Connect the Dots
Mental Models for SREs
Mohit Suley (Microsoft) woke up the audience by asking a couple of questions to test our mental models before the actual talk started.
The first mental model he talked about was the survivorship bias. The connection to SRE is that survivorship bias can influence metrics.
Then he went on to introduce ludic fallacy and explained it by using examples. We laughed at a system that was designed with too little login capacity. Another example was Tay which still seems to be something that people inside Microsoft joke about. No plan survives first contact with the enemy…
Then I learned about decision fatigue. So obviously, the capacity to make decisions decreases over time. A practical advice is, therefore, to not go online shopping when you’re on-call.
It got better! If a thing has been proven over a long period of time, we call this the Lindy Effect. How to connect this to the SRE world? Mohit suggested that even in the age of cloud, client computing is not going down. And in fact, our devices are more powerful than ever. The cloud (remote compute) just grows faster. Another one: Email is still around in 2018. Probably here to stay.
There is much more:
Link to the talk: Mental Models for SREs
This is my personal summary which comes without further explanation. Think of this as a note to myself that accidentally went public:
- Learn more about rubrics and how I can use them to gain better signal from interviews.
- Managing a distributed team requires thoughtful and patience leadership.
- Keep mentoring. It means a lot to mentees. Also, being a mentor and/or a mentee is a learning experience.
- Humans are biased. Providing psychological safety is key to good leadership.
- Symptom based alerting. SLO burn rate alerting. It’s better for human health. It makes so much sense!
- Mental models are fun! Easy to get lost in Wikipedia when you start looking them up…
Wow, you made it to the end of this very long article! Thank you so much for reading. As always, feedback is highly appreciated. As a reward, have this picture of a cute squirrel I met earlier this year in New York: