Phew! Day three comes to an end here in Singapore! It’s Friday evening which means the weekend is about to start. 🎉 Sorry European and American folks, you have a few more hours to go.
We had a generous dinner reception last night with highly attentive staff. That may have lead to one or the other SRE exceeding their own SLO for beverages. At least that was my impression when I look at the tail latency for showing up for the first talks of the day. 🤨 Anyway, let’s sum up what’s been hot today!
Getting More out of Postmortems and Making Them Less Painful to Do
Speaker: Ashar Rizqi (Blameless)
Blameless Inc. is rumored to be the new star in the SRE tools realm. Therefore I was interested to learn how they think about post mortems.
Ashar asked the audience to agree on the benefits of doing post mortems by show of hands:
- Build more reliable systems
- Get important insights
- Continuously innovate - build new features
- Hire and retain the best talent by having a blameless culture
Then Ashar moved on to case studies from Box, Digital Ocean, and Home Depot. And another few rounds of show of hands. He likes show of hands. ✋✋
The typical post mortem process as seen in Blameless’ case studies.
Who should own the post mortem?
- Service owner owns post mortem
- Manager/Director/VP owns time allocation
- Track ownership via ticket
- Set up a Post Mortem Guild. That is, a group inside the organization that is passionate about post mortems.
Who to ensure on-time completion?
- Block new releases for teams with outstanding post mortems
- Do them on Slack or whatever the internal chat app is
- Allocate time on sprint, escalate to C-level
- Gamify and reward for on-time completion
I have some thoughts on this. Some of the suggested solutions are actually big bombs to set off. Escalating to C-level is something that one can pull only so often. Gamifying is always a risky move. People will play the game.
Who collects all the details in a timely manner?
According to Ashar the Incident Commander is responsible for pulling in all the information needed. Once again, a synchronous process via chat app is suggested.
How to track action items?
- Use one ticketing system. Looks like it is not unusual to have multiple ticketing systems in a company. Sounds chaotic to me.
- Use tags profusely, most systems have them. I’d add: And filter for those tags and actually use them. 😉
- Have SLO impact attached to action item.
- Generate daily outstanding action item report.
How to foster a blameless language?
- Don’t call out individuals or teams. If you have to, use initials not full names.
- Move away from a single root cause
I missed the remaining points because the next slide was up already. Presentation speed up? 🏎
What is still debated or unresolved?
- Asynchronous vs. Synchronous
- When do you declare a post mortem complete?
- Knowledge Extraction: How to get value out of a database of post mortems?
I was thinking of skipping this talk initially. After all, everyone believes Google has figured out and perfected the post mortem process. However, I have learned about different ways and processes to create, own, and follow up on post mortems. Most interestingly, many companies seem to prefer synchronous tools like meetings or chat. I am more used to a primarily asynchronous process. It was also nice to see how post morten and post mortem process ownership is handled at other companies. Big thanks to the audience for openly sharing their internal processes and structures!
The MTTR Chronicles: Evolution of SRE Self Service Operations Platform
Speakers: Jason Wik, Jayan Kuttagupthan, and Shubham Patil (VMware)
Jason introduced the challenges that the SRE team at VMware is facing. They aim to reduce the MTTR (Meant Time To Recovery) in a landscape of diverse multi-environment infrastructure that looks very different for each customer. Basically our SRE nemesis: Complexity! 🤯
A complex environment.
Shubham continued by highlighting how the team approached the challenges. They created a platform called North Star:
An Extensible, Dynamic, and Collaborative platform to reduce MTTR and improve operational efficiency for unique and constantly changing environments.
They integrated different platforms into a single user interface:
- Alerting (pagerduty) 🚨
- Health and status information
- Automation tasks
However, these were basically tabs in a larger tool.
An engineer is overwhelmed by a multitude of tools and processes.
Jayan proceeded to show how they correlated the data from the different tabs (platforms) onto a single pane of glass. That is, correlating the data and providing a 360 degree view of incidents. By having alerts and on-going automation tasks correlated an SRE can quicker triage and thus reduce the MTTR. Having cause, symptom, and action presented in a single place helped to react faster. The UI was reported to be updated in real-time and responsive. Also: Integration with Business Intelligence and ticketing systems.
Unfortunately there were no screenshots or demos of the platform. I would have really loved to see what this wondrous platform looks like in practice. 😳
Building Centralized Caching Infrastructure at Scale
Speaker: James Won (LinkedIn)
James is part of the Caching as a Service (CaaS) team at LinkedIn. Pretty cool: He used slido’s live polling feature in his presentation. Very nerd-friendly way of involving the audience.
At LinkedIn teams were frustrated with operating memchached. As a drop-in replacement they decided to use Couchbase. Then the usage exploded to over 2000 hosts in production. These were managed by different teams. That caused some problems:
- Lack of operations interest: Teams just wanted to cache data and were not interested much in running the infrastructure for that.
- Custom deployments: Including maintenance windows during which cache clusters were not available.
- Runaway hardware growth: Waste of money.
LinkedIn decided to create a caching team that manages caching in a centralized way at scale. The team had three main goals:
- Build & manage at scale
- Improve hardware efficiency
- Improve security
The CaaS team now provides:
- 0-1ms 95 percentile latency for get and set operations
- 10ms SLO
Depending on the use case the caching may be backed up by SSD, HDD, or pure memory. The team also provides dashboards and metrics to the teams. Furthermore, as the caches are fully managed, the CaaS team takes care of OS and software updates. However, they don’t want to own the data and they do not own backups. The teams know best what is acceptable for their data and therefore own the data. The CaaS team runs about 2000 hosts serving over 10 million QPS across multiple clusters.
On of the slido questions used by James to engage the audience.
One of the challenges they encountered was GDPR and configuration management. To deal with the configuration challenge they created a wrapper around Couchbase. The wrapper was build in a way that it could be plugged into the existing deployment process and tools. Another challenge was to run Couchbase as a non-root user which, at that time, was not supported by Couchbase. James reports significant code changes and overhauling the deployment process. They were able to pull it off without customer noticing. Nice!
- Treat servers as cattle🐮, not pets😾.
- Start with a core offering and iterate from there.
- Codify checklists✅into automation⚙️ once they are reasonable well tested.
- Build platforms, not tools🛠.
- Trust your automation🤖. If needed, try to understand your automation better.
Next steps for the CaaS team is provide a self service for creating caching buckets. Very good talk that provided interesting insights.
Hybrid XFS - Using SSDs to Supercharge HDDs at Facebook
Speaker: Skanda Shamasunder (Facebook)
Skanda started strong by claiming: A stupidly simple solution that looks risky on the outside can greatly improve performance.
The IO Wall
More bytes on disks but due to the stalling number of arms the disk seeks remain the same. Workloads getting hotter, ML, video streams, Means: today we buy more disks to get more IO and not to get more storage
The IO Wall defined as the point were IO and capacity boundaries cross.
Facebook uses XFS. An interesting number of metadata writes were happening they noticed. About a quarter of the IO was spent on metadata writes. Then they stumbled upon XFs real-time mode. A little known feature that puts data and metadata on different devices. So they thought why not put metadata on SSDs were IOPs are cheap and put data on HDDs where bytes are cheap? That’s just what they did in an experiment. The experiment went really well. They nearly eliminated random writes. Now they can use bigger disks and fully utilize them.
Date goes to spinning disk while metadata ends up on SSD.
At this point in the talk Skanda fooled the audience. Not spoilers, go watch the recording once it is out!
SSD failures? With the metadata lost they would also lose the data. And what if the workload changes? Is it worth rolling out such a fundamental change to the fleet?
SSDs die less often than HDDs. But they take the HDDs with them due to the metadata being important. What if we multiple SSDs die at once? Solution is to replicate the metadata. They also ran endurance tests that went overall well. Next item to analyze was if buying SSDs to utilize the HDDs better was a good business move. It turns out to be a money saver due to not having to buy thousands of new disks to increase the IO.
How to roll this change out to tens of thousands of hosts? Carefully and with automation!
- Hard problems can have simple solutions.
- Gut feelings can be wrong.
- Data wins arguments
- Better safe than sorry
Extending a Scheduler to Better Support Sharded Services
Speaker: Laurie Clark-Michalek (Facebook)
Sharded services refers to workloads that need access to a shard of data. It took me a while to get used to what scheduling means in this context. I am used to a different mode of running services. A scheduler at Facebook seems to be responsible for scheduling and de-scheduling tasks on hardware hosts.
The central piece is a scheduler that knows about machine health, updates, and upcoming maintenance. The scheduler, however, has to ask the service if it can proceed with the planned scheduling operation. The service might decline being re-scheduled.
Trivia: At lunch Laurie and I talked about engineering culture at Facebook. I asked him if he had a single emoji to describe Facebook what it would be. My guess: 🤠 His actual answer: 🙃
I have the feeling how the scheduler is designed is pretty much influenced by Facebook’s engineering culture which gives service owners significant freedom and say down to the hardware level.
In the end it is about a trade-off of power. Should service owners be allowed to block a task migration? Should they be able to block rack or data center drains? Or do you want a scheduler that always wins?
The overall question may be: Can we make schedulers aware of sharded services and their special needs? Maybe even in Kubernetes? Do we want this at all?
Yes, No, Maybe? Error Handling with gRPC Examples
Speaker: Gráinne Sheerin (Google)
I was looking forward to this talk because I stumbled upon gRPC error handling in the past.
If a service’s response doesn’t equal
OK it gets interesting.
GRPC has ~16 status code to indicate errors. Only a few of them can be issued by the auto-generated code and the gRPC library. All status codes can be used by the application developer to communicate the reason for an error.
Gráinne walked us through a couple of interesting error cases. For example,
DEADLINE_EXCEEDED can mean a request never reached the server. It can also mean everything worked but the response came in too late at the client stub. Then the client stub overrides the
OK status code with
DEADLINE_EXCEEDED. You can’t use the response and you wasted resources on the server side. Mind blown. 🙀 I think I never thought about the stubs really. I just assumed they are there and transparent.
A better approach is to check the deadline on the server side and cancel the request there if it can’t possible be served in time. It gets more confusing if the server sets the status code to
DEADLINE_EXCEEDED instead of
CANCELLED. I feel like debugging this can be painful. If metrics are used for troubleshooting one should be aware which stub the specific metrics were collected at.
- Tell clients which are temporary and which are permanent errors.
- If there is more than one error return the most specific one.
- Hide implementation unless you want client decision to depend on it.
- Don’t blindly propagate errors. They can contain confidential data.
Ethics in SRE
Speakers: Laura Nolan (Slack) and Theo Schlossnagle (Circonus)
Details: Ethics in SRE
History time: Civil engineering had it’s fair share of disasters that killed people. Engineers were the ultimate experts on the things they were building and as professionals they started caring about doing things right. Laura and They argue our profession is at a similar point today. We have to start thinking more about serving society and demanding ethical standards for the work we do. Similar to civil engineers not building bridges that have too weak of a structure to carry the load, not even when their employers demand it.
Theo: There are doctors and lawyers who are not allowed to practice anymore because they violated ethical standards. This is what maintains trust between the profession in general and society.
Laura: A computer system is not like a bridge, ethically speaking. It is even more complex. Rationale: We can’t easily inspect computer systems. Not that bridges are easy to inspect, they are at least physically accessible for inspection. Complex computer systems are ever-changing and inspection seems infeasible.
Theo: We are, as an industry, in our thirties but we are behaving like 4-year-olds.
They discussed a bunch of examples that I think are best experienced first hand. So instead if writing it down I allow myself to redirect the reader to the video recording (to be released in a couple of weeks).
It’s an incredible important discussion we have to have. Otherwise we end up being tightly regulated or loose our creative freedom.
Related: Tanya Reilly’s talk on The History of Fire Escapes at SREcon Americas 2018.
It has been decided!
In the last couple of weeks the community discussed what a good collective noun for a group of SREs would be. Obviously, a rant of SREs is the best option. But since we like democracy we had a vote on it. The winner is a cluster of SREs. I guess that is democracy, having a reliable second-best option and run with it.
The rant not being the most favored collective noun for a group of SREs.
This was another great SREcon! Thanks fly out to the program chairs, the program committee, the speakers, the room captains, the helping hands in the background, and the companies that encouraged their employees to share content and supported with travel and diversity grants. Thank you!
I had interesting conversations and now look forward to some chillaxing before I leave again tomorrow. I’ll miss Singapore!
Today it was raining again. Please have a skyline picture from two days ago when we still had sun.