Good Morning, Europe! Please enjoy my SREcon Asia/Pacific 2019 day one conference notes fresh out of the press from beautiful Singapore. This year I signed up for volunteering at the Google both and for serving as room co-captain at a couple of sessions. The reports will probably a bit shorter than last year’s.
SREcon APAC is growing like crazy! The program co-chairs Frances Johnson (Google) and Avleen Vig (Facebook) opened the event welcoming attendees and sponsors. They also provided data:
- Over 580 attendees!
- Attendees from 24 countries and 125 companies
- The program committee consisted of 24 members
- 73 speakers were selected to give one or more talks
It’s always interesting how attendees self-identify regarding their role. This time we have 58% engineers, 27% managers, while the rest identifies as tech lead or other. Last year I was in the role of tech lead but I have changed sides and now I am counting against the growing number of managers that are interested in SRE.
The organizers tried something new this year: Slido table topics. That is, attendees are invited to suggest and vote on topics. The most popular ones are printed on paper and put on some of the lunch tables. These tables are thus designated for people interested in the table topic. This helps kickstart conversations in a birds-of-a-feather kind of way.
This year’s theme is Depth of Knowledge. Our industry is expanding, we deal with increasing complexity over time, and we need to strive harder to understand how our systems work.
A Tale of Two Postmortems - A Human Factors View
Speaker: Tanner Lund (Microsoft, Azure SRE) with help from role playing actors whose name I did not catch but they should be on the recording.
Details: A Tale of Two Postmortems
Role play: A Post Mortem discussion. It was more like an interrogation around human error and how to prevent it in the future (monitoring, playbooks, slowing down releases, re-thinking if automation was the right choice).
Second role play: A conversation focused around what happened, how the on-caller felt, what information they had at that time and how that lead to the conclusions that were drawn. This surfaced alert fatigue and also that the way the systems are run is exhausting to the on-callers. To me also some of the processes looked sub-optimal. By changing the way people talk about what happened these very interesting facts were discovered. The overall impression of the second role play was that it provided much more psychological safety.
My key takeaways:
- Learn from what happened. It is not about how to prevent this specific type of incident in the future. However, this will probably be a fortunate outcome of whatever action items are agreed upon.
- We assume everyone’s decisions made perfectly sense at the time they were made.
- Don’t jump to
conclusionsaction items right away.
Availability: Thinking beyond 9’s
Speaker: Kumar Srinivasamurthy (Microsoft, Bing)
Details: Availability—Thinking beyond 9s
Any large system has outages. Kumar gave some examples of which we as an industry are not short of.
Bing had an incident and the product did not present the actual search results but only sponsored results (ads). Not really a degraded response because serving only ads has the potential to scare users away.
Question from the speaker: What is a good availability number? Audience: It depends! Smart 😅
First example: Parking meter. What availability is really needed? It’s likely enough if it works on weekdays. It should reliably collect money. Occasional takedowns for maintenance are not hurting much.
Next example: Flying. Given 100k flights a day, what availability do we need? Three nines (99.9%) would result in multiple disaster per day. Even five or six nines would still be too much. Interesting, I never thought about how many nines flying has. Of course, this example is too simplified. But it serves as a good starting point for thinking about nines.
The talk progressed looking into typical challenges of measuring and monitoring availability and reliability. I found most points rather high level and think they are already covered in the SRE books.
Fun quote: Adding nines since 2009!
Use Interview Skills of Accident Investigators to Learn More about Incidents
Speaker: Thai Wood (Resilience Roundup)
Details: Use Interview Skills of Accident Investigators to Learn More about Incidents
My first thought was: Lamp-in-the-face-style interrogation. Immediately Thai made clear this is not about interrogation! It is about respectfully interviewing people to understand an incident. Research has shown that it makes a significant difference how you ask questions.
- Who should drive the interview? It’s the person who is being interviewed. They were there, they know more than the interviewer. They should be speaking most of the time.
- Expectations: When you ask people something they want to give an answer, which puts some pressure on answering even when the facts are not well remembered. A better technique would be to give people space to tell their story and make clear that there is no specific expectation. This is reported to work surprisingly well.
- Question type: Obviously you want to ask open ended questions.
- How much do you want to know? Ask: Tell me everything even if it is out of order or seems irrelevant! There is a ton of value in the details even if they seem unrelated. You can sort out later what was really relevant.
- Don’t interrupt! Should be obvious but seems to still happen.
- Don’t follow a template. Don’t use a form. It will constrain what people will tell you.
- Adopt the questions to the person being interviewed as the interview progresses.
- Make the setting safe and encourage people to say I don’t know that if they don’t know. Otherwise they will (most likely unintentionally) make up answers.
Leading without Managing: Becoming an SRE Technical Leader
Speaker: Todd Palino (LinkedIn)
Details: Leading without Managing: Becoming an SRE Technical Leader
Todd shared his story of how he became a technical leader despite having spent about a decade at VeriSign. His words, my interpretation. 😇
There is nothing wrong with banging out code.
At LinkedIn, he reported, they have a clear way of measuring engineers. You might get a promotion on the back of one or two projects. Or because of tenure. They have clearly defined titles and what they mean. The natural progress is chasing titles, but not in a bad way. Interesting.
He pointed out that being a senior staff engineer means something different at other companies. There is no formal definition that spans the industry.
Here’s what LinkedIn has come up with:
- Junior SRE: Is task driven. Completes a task, gets the next task, etc.
- SRE: Has a bit more autonomy, develops automation, shares knowledge.
- Senior SRE: Is a stack expert. Drives design and collaboration. Mentors and probably receives mentorship themselves.
- Staff SRE: Moving into leadership. Has deep knowledge about multiple stacks. Thinks about what comes up next. Supports their manager to focus on people management by being the technical right hand to them.
- Senior Staff SRE: Seasoned Staff SRE with even more oversight and responsibilities.
The three pillars of Engineering at LinkedIn:
- Execution: Getting Things Done
- Craftsmanship: Doing Things Well
- Leadership: Enabling Others to Get Things Right
Then digging into leadership! It’s hard to determine if someone is a good leader. Roles are increasingly less clear defined when one moves into leadership. Some say: You define your role! How helpful. 🙃
Find your happy path. What do you like? Is it management meetings and moving things forward? Is it cranking out code? Is it thinking where technology should head to? What kind of leadership do you want for yourself?
One way to find out could be project work which comes with the following aspects:
- Engagement: Coordinating multiple teams
- Impact: Change your organization
- Inception: Who came up with the idea?
- Process: Not all projects are technical
This talk was so full of content that I had a hard time following and taking notes at the same time. Here are some unsorted items from the talk:
- Mentoring: As a leader you want (I’d even say: need) to be a mentor and you want to be mentored yourself.
- Looking outside the industry. Find a toastmasters chapter nearby and train public communication.
- Meetups are also a place where you can practice talks and sharing knowledge. Also builds up a network.
- Get more exposure and practice: Ignite talks, lightning talks, writing a blog, speak at conferences, write articles for magazines.
- Short-form publishing. That is, writing very short books.
- Engage in diversity, inclusion, and ethics.
- How to identify meaningful work? Whenever you hear the phrase Someone should… you could be that someone!
- Keep a paper trail. Document your projects, comment your code, turn tribal knowledge into artifacts that others can use, keep a list of your accomplishments (also helpful for promotion).
- Stay organized. It’s not about having every minute of your day being planned out. But having a rough idea where you want to spend your time is advised.
Can’t help myself but I have to point out that I found it incredible hard to read some of the slides. The colors were not the best choice. 🙈 Nevertheless, good talk and Todd is a great speaker!
Talks from LinkedIn being at the forefront of SRE and culture becomes a repeating pattern. These folks really know about and reflect on their craft.
Let’s Build a Distributed File System
Speaker: Sanket Patel (LinkedIn)
Details: Let’s Build a Distributed File System
Sanket first explained how basic file system operations such as listing, reading, and writing are related to inodes. Then he went on to design a simple distributed file system. Distributed meaning in this context to run one file system in a fault-tolerant way across multiple networked hosts with disks attached. From my experience the crucial point in designing such a system is locking and consistent metadata management. Sanket solved this by introducing a master server that serves all metadata and minion servers that store the data blocks. Clients talk to the master and the minions to perform file system operations.
I’d argue that this is not really solving the problems we usually face when dealing with distributed file systems. The master, for example, poses a risk because it is a single point of failure. The server is threaded but I am missing locking and synchronization. Data consistency is not guaranteed by this design. On the other hand Sanket warned in the talk description that this will be a very basic implementation. After all, this talk appeared in the Core Principles track which is designed for exactly this type of foundational education.
The source code is public and described on Sanket’s blog. Great work, Sanket!
Shipping Software with an SRE Mindset
Speaker: Theo Schlossnagle (Circonus)
Details: Shipping Software with an SRE Mindset
Circonus uses C for high performance applications. One of the challenges (like, next to debugging an application written in C) is to expose metrics. Naturally, they had to write their own metrics library that works concurrently and fast.
- When in doubt: Expose telemetry!
- A typical request can produce between 10 (ten) and 2000 (two thousand) spans from distributed tracing. Quite a variance! But the traces are not stored very long at Circonus.
- Traces are used with reduced details due to performance. A full trace can be generated on request and might yield up to 4 GB of highly detailed debugging dump. Someone then has to dig through it. Fun!
- Surprising approach: Tracing data is produced in memory and then pushed into a message queue. I haven’t really understood why this is superior but it does give systems the ability to subscribe to traces. 🤔
- Theo is not a big fan of logs. He thinks logs should be designed for humans unless they are purely machine to machine communication.
The talk included code samples. It was good to see and think some C for a change. Go has taken over basically all my coding and I don’t even read C code anymore. 😢 Good ol’ C.
Networking at the Google Booth
An interesting experience for someone who is more of an introvert in face-to-face conversations. I talked to people who approached our booth and learned how many companies are out there doing products that are similar to Google products and also very successful. Getting myself out of Europe every now and then continues to be a great experience.
I also had some non-technical conversations. I had the chance to once again share my theory of the incompleteness of the German language. For example, we have an adjective for being hungry (hungrig) and one for not being hungry anymore after eating something (satt). However, for being thirsty (durstig) there is no such opposite adjective. We tried to make one up twice. The first time we could not decide on a word. The second time the word (sitt) was not really accepted by the wider population.
Anyway, I did let my mind wander… What an exhausting first day. Stay tuned for tomorrow!
Welcome, welcome to another day in Singapore full of interesting talks at SREcon Asia/Pacific 2019. Here’s the gist of what I listened to, talked about, and learned today. I spend more time in technical talks today. I needed a bit of an deep dive to contrast the high-level topics from yesterday. The inner engineer is strong with me. 👷♂️I am glad SREcon offers both! Kudos to the program committee.
The Early Bird
Speaker: Me (Google) about work that predates my joining date and is unrelated to my employer.
Details: Implementing Distributed Consensus
I shall not toot my own horn. I let others decide if that was worth their time. Here’s the source code for your interest: The Skinny Distributed Lock Service
Edge Computing: The Next Frontier for Distributed Systems
Speaker: Martin Barry (Fastly)
Details: Edge Computing: The Next Frontier for Distributed Systems
Martin pointed out that the talk was about personal observations and not related to Fastly.
Martin wants to spark a discussion about Edge Computing within the community and shared his thoughts and definition.
Executing non-trivial functionality as close to the client as is reasonable.
Martin presented a hierarchy of from where a request can be served ranging from the Origin (e.g. non-cached original response) down to on the Client itself.
- Continental Regions
- Data centers in major cities (carrier-neutral, Internet Exchanges)
- Internet Service Providers (ISP)
- Last mile
One challenge is that subsequent requests can end up at different levels of that hierarchy or different serving processes at the same level. I assume cache management is challenging for such setups. Martin says scalable solutions should depend on as little state as possible.
Typical applications for Edge Computing are:
- Request normalization
- Authentication or Paywall
- Vary by user agent
- A/B Testing
Interesting: How does one do Edge Computing anyway? I learned that the runtime is often provided by the entity running the edge cache/compute resources. I wasn’t aware of that and somehow thought I would own the full stack. Common practice is using Domain Specific Languages (DSL) or containers. With containers leading to having multiple copies of the same data in memory which is wasteful.
The new hotness are these two fellas:
- WebAssembly (WASM)
- WebAssembly System Interface (WASI)
- Continuous integration, deployments, rollbacks, and of course testing on the edge. * The hardest part seems to be load testing. How to load test something that is far away from the origin but closer to the client?
- How to integrate provider metrics into your own metric processes and infrastructure?
- Did I mention distributed tracing?
- External health checks. They are usually run from well-connected data centers and not from the far edge. Oh my!
It totally makes sense! Why not push WASM to whatever runtime is provided? It could even be the client’s browser if the application allows that. Very informative talk! 👍
Critical Path Analysis - Prioritizing What Matters
Speaker: Althaf Hameez (Grab)
Details: Critical Path Analysis - Prioritizing What Matters
Althaf asks: Do we have a subset of services that are critical to our core business running? The definition of core business here being:
Any impact on our system that impacts the ability of a passenger to get a car to get safely from Point A to Point B.
It may not have all the bells and whistles of the fully fledged experience but it gets the core business done.
Apparently, where Grab operates cash is still a thing (Althaf: Cash is King). So that means cashless payments are not really in the critical path. That ruled out a number of services that are not critical. If you ever worked with payment processing you might know how much complexity is in that sometimes.
To validate the critical path they ran it in the testing environment first. Then in production. Scary? Kind of, but Grab operates in south-east Asia only and therefore sees a pretty cyclic user pattern. They were able to relatively test their hypothesis in production during night.
On an organizational level Althaf sought executive sign-off for the riskier things and involved the product teams early. On a technical level circuit breakers did the job. Caching helped to degrade gracefully, for example geo fences and city details rarely change and can be safely cached for a while.
Sometimes, Althaf reports, it is OK to fail. For example, a ride usually ends with the passenger seeing a dialog to rate the driver. However, when the rating service is failing the dialog won’t go away. With the dialog in the way the passenger can not book a new ride. While rating a driver is an important feature it is not essential enough to block new bookings. Nowadays the Grab app continues to let you book rides even if the rating failed. Shows what a critical path analysis can surface. In hindsight it sounds obvious but we all know how often our systems surprise us with things that should be obvious, right? 😲
Collective Mindfulness for Better Decisions in SRE
Speaker: Kurt Andersen (LinkedIn)
Details: Collective Mindfulness for Better Decisions in SRE, Slides
I only recently started meditating (can recommend) which means I am still at the beginning of the mindfulness journey. Mindfulness has become an interesting and helpful tool for me. So my expectations were quite high on Kurt’s talk. I was not disappointed. 🤩
The concept of collective mindfulness has emerged in the last 20 years in research. Kurt kicked it off with giving examples of mindlessness. He defined it as automatic and reactive behavior. Like staring at the same dashboards again and again and reacting the same way. There is a lot of potential for mindlessness in SRE.
Kurt describes the characteristics of the environment we work in as VUCA:
Mindfulness, in contrast, is being able to reflect on our behavior and identify improvements in how we react to VUCA.
Collective Mindfulness: Capability to discern discriminatory detail about emerging issues and to act swiftly in response to these details.
One SRE-related aspect of collective mindfulness is the reluctance to accept simple solutions. I can think of it must have been a network blibb as a common explanation that some teams use instead of properly investigating an issue.
The aspects include:
- A preoccupation with failure: prevents failures by focusing on discovering incipient failures and their components.
- A reluctance to simplify interpretations (see above example)
- Operations: Recognizes that a solution to one problem may create another and therefore process-wide measurement is essential.
- A commitment to resilience: We make things better to not get paged again for the same reason.
- Recognizing the expertise of people running things, not necessarily the developers or architects
To continue practicing collective mindfulness a team needs to know where to look instead of looking at all the things. I sense this is something we need intuition for?!
Here’s what to do:
- Focus on failure
- Refusal to simplify
- Staying attuned to operations
From the military we learned about the STICC model of communication:
- Situation: Here’s what I think we face
- Task: Here’s what I think we should do
- Intent: Here’s why
- Concern: Here’s what we need to watch
- Calibrate: Now talk to me
Kurt called this form of communication ritual and emphasized that it is a useful tool for relaxing a stressful situation. It fits in well with other rituals we have in SRE, like writing post mortems or following up on action items by setting up project work with a clear goal. From my time serving as an army officer I can confirm that ritualized communication can be a great tool in the right situations. The more it is practiced the better it works when things go so wrong that there is no time for questioning the general approach.
Dangers to collective mindfulness are our IT systems itself when they make us perform routine actions or have hard to understand automation.
Linux Memory Management at Scale: Under the Hood
Speaker: Chris Down (Facebook)
Details: Linux Memory Management at Scale: Under the Hood
At the beginning of the talk Chris went over fundamentals on Linux resource management such as cgroups. Resource management is a tricky business. For example, if you memory-limit an application too much you basically transform memory into disk I/O. Not really a win. Chris claims even seasoned SREs often have misconceptions about how memory at scale works.
After ranting about the resident set size as a metric Chris comes to the conclusion that unless we heavily measure and instrument an application we can not really tell how much memory it uses.
Swap, however, deserves a better reputation than it
has. It’s not an emergency memory although it is often seen as that. Having a swap, he
argues, is like running
make -j cores+1. That is putting a little bit of
pressure on the memory to really squeeze the last bit of performance out of it. Without a
swap there would still be disk I/O, e.g. file system cache pages being evicted by writing
them to disk.
The OOM killer shall not be trusted. It is always late to the party. It doesn’t really know what to kill. That means it may go to the wrong party. To avoid this the Linux kernel (via kswapd) tries to reclaim the coldest pages. Some pages may be not be reclaimable that way. Swap can help here by moving them to disk temporarily.
Quiz for the audience: What metric do you look at to find out if there is a memory resource issue? Audience ideas:
- Disk I/O
Chris thinks memory pressure is a much better metric. I agree. I’m a big fan of memory pressure and most of the time it is the only metric I have to look at to rule out memory issues in a system. 👀
Operating systems have many consumers of memory: user allocations, file caches, network buffers, etc… Memory pressure happens when there is a shortage of memory. It represents the the work that Linux (or any other OS) does in order to manage and shuffle memory around to satisfy the system’s many users.
TIL: Facebook has a user-space OOM killer. I am not sure what I should think about that. 🤔 I have the feeling that Facebook runs machines very differently from Google although both companies work together in moving things forward.
Chris continued by showed how Facebook slices server resources, most importantly memory. They use cgroups and he showed some of their initial and improved cgroup hierarchies. I suggest watching the video once it is out because I couldn’t keep up with taking notes.
Cross Continent Infrastructure Scaling at Instagram
Speaker: Sherry Xiao (Facebook)
Details: Cross Continent Infrastructure Scaling at Instagram
I was a bit late to Sherry’s talk which I am sorry for. I missed the first 5 minutes and when I entered Sherry was already deep into sharded Cassandra databases.
Instagram migrates data between regional clusters when a user moves to a different continent. This is really nice! Out-of-region users (e.g. frequent travelers or digital nomads) are a pain to latency-critical databases. Jumping oceans on every request is expensive and frustrating.
Forming a quorum in Cassandra depends on the replication factor. With only so much data centers in a region it may be necessary to include out-of-region nodes in the quorum.
The talk was short but the slides were to the point and the delivery was really good. I enjoyed the talk very much although it brought back some unpleasant memories regarding my own encounters with Cassandra. 😂🙈
Software Networking and Interfaces on Linux
Speaker: Matt Turner (Native Wave)
Details: Software Networking and Interfaces on Linux
Spontaneously, I ended up in the Core Principles track once again. Matt started
with all the basics on Ethernet, IP, interfaces, DHCP,
accept(). I’m not going to write all that down since it has been discussed at
length all over the Internet already.
It got interesting to me when Matt discussed how a process emits packets using software
interfaces, such as a
tun device (Layer 3). This is what one needs when
building a VPN. Once the device is turned into a
tap device it operates on
layer 2. This allows for more crazy stuff to
be implemented, such as the highly elaborate IPoWAC protocol.
Next thing Matt showed was a
br (bridge) device (a virtual switch) and its
properties. Then the bridge was compared to the
ovs (OpenVSwitch) device. Matt
moved on to VMs and quickly discussed the
virtio memory page-sharing virtual
network device we often see in Linux KVM guests. Since this is 2019 and everything is
containers now the next device we looked at was the
veth virtual ethernet pair
veth devices always come in pairs and usually span network
namespaces). Finally Matt wanted to make two containers talk to each other without using a
macvlan device did the trick because it’s simple. I’d argue it
creates kind of a bridge although we don’t have to deal with STP and other exciting
features. If you want to go really crazy there is always the
device. Think twice. 👆️
There was nothing new for me in this talk but it was a good talk going through and un-confusing Linux software networking. The Core Principles track sometimes is like rolling dice. Nevertheless, I think it is important we have this track and that we address all levels. After all, networking is an important, sometimes undervalued area of SRE expertise.
The SREcon is hosted by the Suntec convention center. It’s a fancy place with spacious rooms and modern equipment. Its entrance features a huge wall of full HD screens.
I am curious how such a crazy amount of screens is being controlled. It seems I am not the only one who noticed the nearby WiFi network named “LG_signage”. Coincidence? 🤔 I don’t think so!
Anyway, this is SREcon. DEFCON is still two months away. I shall be a good citizen! But it is not easy. Look at this display! It is even bragging about itself!
Phew! Day three comes to an end here in Singapore! It’s Friday evening which means the weekend is about to start. 🎉 Sorry European and American folks, you have a few more hours to go.
We had a generous dinner reception last night with highly attentive staff. That may have lead to one or the other SRE exceeding their own SLO for beverages. At least that was my impression when I look at the tail latency for showing up for the first talks of the day. 🤨 Anyway, let’s sum up what’s been hot today!
Getting More out of Postmortems and Making Them Less Painful to Do
Speaker: Ashar Rizqi (Blameless)
Details: Getting More out of Postmortems and Making Them Less Painful to Do
Blameless Inc. is rumored to be the new star in the SRE tools realm. Therefore I was interested to learn how they think about post mortems.
Ashar asked the audience to agree on the benefits of doing post mortems by show of hands:
- Build more reliable systems
- Get important insights
- Continuously innovate - build new features
- Hire and retain the best talent by having a blameless culture
Then Ashar moved on to case studies from Box, Digital Ocean, and Home Depot. And another few rounds of show of hands. He likes show of hands. ✋✋
Who should own the post mortem?
- Service owner owns post mortem
- Manager/Director/VP owns time allocation
- Track ownership via ticket
- Set up a Post Mortem Guild. That is, a group inside the organization that is passionate about post mortems.
Who to ensure on-time completion?
- Block new releases for teams with outstanding post mortems
- Do them on Slack or whatever the internal chat app is
- Allocate time on sprint, escalate to C-level
- Gamify and reward for on-time completion
I have some thoughts on this. Some of the suggested solutions are actually big bombs to set off. Escalating to C-level is something that one can pull only so often. Gamifying is always a risky move. People will play the game.
Who collects all the details in a timely manner?
According to Ashar the Incident Commander is responsible for pulling in all the information needed. Once again, a synchronous process via chat app is suggested.
How to track action items?
- Use one ticketing system. Looks like it is not unusual to have multiple ticketing systems in a company. Sounds chaotic to me.
- Use tags profusely, most systems have them. I’d add: And filter for those tags and actually use them. 😉
- Have SLO impact attached to action item.
- Generate daily outstanding action item report.
How to foster a blameless language?
- Don’t call out individuals or teams. If you have to, use initials not full names.
- Move away from a single root cause
I missed the remaining points because the next slide was up already. Presentation speed up? 🏎
What is still debated or unresolved?
- Asynchronous vs. Synchronous
- When do you declare a post mortem complete?
- Knowledge Extraction: How to get value out of a database of post mortems?
I was thinking of skipping this talk initially. After all, everyone believes Google has figured out and perfected the post mortem process. However, I have learned about different ways and processes to create, own, and follow up on post mortems. Most interestingly, many companies seem to prefer synchronous tools like meetings or chat. I am more used to a primarily asynchronous process. It was also nice to see how post morten and post mortem process ownership is handled at other companies. Big thanks to the audience for openly sharing their internal processes and structures!
The MTTR Chronicles: Evolution of SRE Self Service Operations Platform
Speakers: Jason Wik, Jayan Kuttagupthan, and Shubham Patil (VMware)
Details: The MTTR Chronicles: Evolution of SRE Self Service Operations Platform
Jason introduced the challenges that the SRE team at VMware is facing. They aim to reduce the MTTR (Meant Time To Recovery) in a landscape of diverse multi-environment infrastructure that looks very different for each customer. Basically our SRE nemesis: Complexity! 🤯
Shubham continued by highlighting how the team approached the challenges. They created a platform called North Star:
An Extensible, Dynamic, and Collaborative platform to reduce MTTR and improve operational efficiency for unique and constantly changing environments.
They integrated different platforms into a single user interface:
- Alerting (pagerduty) 🚨
- Health and status information
- Automation tasks
However, these were basically tabs in a larger tool.
Jayan proceeded to show how they correlated the data from the different tabs (platforms) onto a single pane of glass. That is, correlating the data and providing a 360 degree view of incidents. By having alerts and on-going automation tasks correlated an SRE can quicker triage and thus reduce the MTTR. Having cause, symptom, and action presented in a single place helped to react faster. The UI was reported to be updated in real-time and responsive. Also: Integration with Business Intelligence and ticketing systems.
Unfortunately there were no screenshots or demos of the platform. I would have really loved to see what this wondrous platform looks like in practice. 😳
Building Centralized Caching Infrastructure at Scale
Speaker: James Won (LinkedIn)
Details: Building Centralized Caching Infrastructure at Scale
James is part of the Caching as a Service (CaaS) team at LinkedIn. Pretty cool: He used slido’s live polling feature in his presentation. Very nerd-friendly way of involving the audience.
At LinkedIn teams were frustrated with operating memchached. As a drop-in replacement they decided to use Couchbase. Then the usage exploded to over 2000 hosts in production. These were managed by different teams. That caused some problems:
- Lack of operations interest: Teams just wanted to cache data and were not interested much in running the infrastructure for that.
- Custom deployments: Including maintenance windows during which cache clusters were not available.
- Runaway hardware growth: Waste of money.
LinkedIn decided to create a caching team that manages caching in a centralized way at scale. The team had three main goals:
- Build & manage at scale
- Improve hardware efficiency
- Improve security
The CaaS team now provides:
- 0-1ms 95 percentile latency for get and set operations
- 10ms SLO
Depending on the use case the caching may be backed up by SSD, HDD, or pure memory. The team also provides dashboards and metrics to the teams. Furthermore, as the caches are fully managed, the CaaS team takes care of OS and software updates. However, they don’t want to own the data and they do not own backups. The teams know best what is acceptable for their data and therefore own the data. The CaaS team runs about 2000 hosts serving over 10 million QPS across multiple clusters.
One of the challenges they encountered was GDPR and configuration management. To deal with the configuration challenge they created a wrapper around Couchbase. The wrapper was build in a way that it could be plugged into the existing deployment process and tools. Another challenge was to run Couchbase as a non-root user which, at that time, was not supported by Couchbase. James reports significant code changes and overhauling the deployment process. They were able to pull it off without customer noticing. Nice!
- Treat servers as cattle🐮, not pets😾.
- Start with a core offering and iterate from there.
- Codify checklists✅into automation⚙️ once they are reasonable well tested.
- Build platforms, not tools🛠.
- Trust your automation🤖. If needed, try to understand your automation better.
Next steps for the CaaS team is provide a self service for creating caching buckets. Very good talk that provided interesting insights.
Hybrid XFS - Using SSDs to Supercharge HDDs at Facebook
Speaker: Skanda Shamasunder (Facebook)
Details: Hybrid XFS - Using SSDs to Supercharge HDDs at Facebook
Skanda started strong by claiming: A stupidly simple solution that looks risky on the outside can greatly improve performance.
The IO Wall
More bytes on disks but due to the stalling number of arms the disk seeks remain the same. Workloads getting hotter, ML, video streams, Means: today we buy more disks to get more IO and not to get more storage
Facebook uses XFS. An interesting number of metadata writes were happening they noticed. About a quarter of the IO was spent on metadata writes. Then they stumbled upon XFs real-time mode. A little known feature that puts data and metadata on different devices. So they thought why not put metadata on SSDs were IOPs are cheap and put data on HDDs where bytes are cheap? That’s just what they did in an experiment. The experiment went really well. They nearly eliminated random writes. Now they can use bigger disks and fully utilize them.
At this point in the talk Skanda fooled the audience. Not spoilers, go watch the recording once it is out!
SSD failures? With the metadata lost they would also lose the data. And what if the workload changes? Is it worth rolling out such a fundamental change to the fleet?
SSDs die less often than HDDs. But they take the HDDs with them due to the metadata being important. What if we multiple SSDs die at once? Solution is to replicate the metadata. They also ran endurance tests that went overall well. Next item to analyze was if buying SSDs to utilize the HDDs better was a good business move. It turns out to be a money saver due to not having to buy thousands of new disks to increase the IO.
How to roll this change out to tens of thousands of hosts? Carefully and with automation!
- Hard problems can have simple solutions.
- Gut feelings can be wrong.
- Data wins arguments
- Better safe than sorry
Extending a Scheduler to Better Support Sharded Services
Speaker: Laurie Clark-Michalek (Facebook)
Details: Extending a Scheduler to Better Support Sharded Services
Sharded services refers to workloads that need access to a shard of data. It took me a while to get used to what scheduling means in this context. I am used to a different mode of running services. A scheduler at Facebook seems to be responsible for scheduling and de-scheduling tasks on hardware hosts.
The central piece is a scheduler that knows about machine health, updates, and upcoming maintenance. The scheduler, however, has to ask the service if it can proceed with the planned scheduling operation. The service might decline being re-scheduled.
Trivia: At lunch Laurie and I talked about engineering culture at Facebook. I asked him if he had a single emoji to describe Facebook what it would be. My guess: 🤠 His actual answer: 🙃
I have the feeling how the scheduler is designed is pretty much influenced by Facebook’s engineering culture which gives service owners significant freedom and say down to the hardware level.
In the end it is about a trade-off of power. Should service owners be allowed to block a task migration? Should they be able to block rack or data center drains? Or do you want a scheduler that always wins?
The overall question may be: Can we make schedulers aware of sharded services and their special needs? Maybe even in Kubernetes? Do we want this at all?
Yes, No, Maybe? Error Handling with gRPC Examples
Speaker: Gráinne Sheerin (Google)
Details: Yes, No, Maybe? Error Handling with gRPC Examples
I was looking forward to this talk because I stumbled upon gRPC error handling in the past.
If a service’s response doesn’t equal
OK it gets interesting. GRPC has ~16
status code to indicate errors. Only a few of them can be issued by the auto-generated code
and the gRPC library. All status codes can be used by the application developer to
communicate the reason for an error.
Gráinne walked us through a couple of interesting error cases. For example,
DEADLINE_EXCEEDED can mean a request never reached the server. It can also
mean everything worked but the response came in too late at the client stub. Then the
client stub overrides the
OK status code with
You can’t use the response and you wasted resources on the server side. Mind blown. 🙀 I
think I never thought about the stubs really. I just assumed they are there and
A better approach is to check the deadline on the server side and cancel the request there
if it can’t possible be served in time. It gets more confusing if the server sets the
status code to
DEADLINE_EXCEEDED instead of
CANCELLED. I feel
like debugging this can be painful. If metrics are used for troubleshooting one should be
aware which stub the specific metrics were collected at.
- Tell clients which are temporary and which are permanent errors.
- If there is more than one error return the most specific one.
- Hide implementation unless you want client decision to depend on it.
- Don’t blindly propagate errors. They can contain confidential data.
Ethics in SRE
Speakers: Laura Nolan (Slack) and Theo Schlossnagle (Circonus)
Details: Ethics in SRE
History time: Civil engineering had it’s fair share of disasters that killed people. Engineers were the ultimate experts on the things they were building and as professionals they started caring about doing things right. Laura and They argue our profession is at a similar point today. We have to start thinking more about serving society and demanding ethical standards for the work we do. Similar to civil engineers not building bridges that have too weak of a structure to carry the load, not even when their employers demand it.
Theo: There are doctors and lawyers who are not allowed to practice anymore because they violated ethical standards. This is what maintains trust between the profession in general and society.
Laura: A computer system is not like a bridge, ethically speaking. It is even more complex. Rationale: We can’t easily inspect computer systems. Not that bridges are easy to inspect, they are at least physically accessible for inspection. Complex computer systems are ever-changing and inspection seems infeasible.
Theo: We are, as an industry, in our thirties but we are behaving like 4-year-olds.
They discussed a bunch of examples that I think are best experienced first hand. So instead if writing it down I allow myself to redirect the reader to the video recording (to be released in a couple of weeks).
It’s an incredible important discussion we have to have. Otherwise we end up being tightly regulated or loose our creative freedom.
Related: Tanya Reilly’s talk on The History of Fire Escapes at SREcon Americas 2018.
It has been decided!
In the last couple of weeks the community discussed what a good collective noun for a group of SREs would be. Obviously, a rant of SREs is the best option. But since we like democracy we had a vote on it. The winner is a cluster of SREs. I guess that is democracy, having a reliable second-best option and run with it.
This was another great SREcon! Thanks fly out to the program chairs, the program committee, the speakers, the room captains, the helping hands in the background, and the companies that encouraged their employees to share content and supported with travel and diversity grants. Thank you!
I had interesting conversations and now look forward to some chillaxing before I leave again tomorrow. I’ll miss Singapore!