SREcon Asia/Pacific 2019

Day One

Good Morning, Europe! Please enjoy my SREcon Asia/Pacific 2019 day one conference notes fresh out of the press from beautiful Singapore. This year I signed up for volunteering at the Google both and for serving as room co-captain at a couple of sessions. The reports will probably a bit shorter than last year’s.

Opening

SREcon APAC is growing like crazy! The program co-chairs Frances Johnson (Google) and Avleen Vig (Facebook) opened the event welcoming attendees and sponsors. They also provided data:

It’s always interesting how attendees self-identify regarding their role. This time we have 58% engineers, 27% managers, while the rest identifies as tech lead or other. Last year I was in the role of tech lead but I have changed sides and now I am counting against the growing number of managers that are interested in SRE.

The organizers tried something new this year: Slido table topics. That is, attendees are invited to suggest and vote on topics. The most popular ones are printed on paper and put on some of the lunch tables. These tables are thus designated for people interested in the table topic. This helps kickstart conversations in a birds-of-a-feather kind of way.

This year’s theme is Depth of Knowledge. Our industry is expanding, we deal with increasing complexity over time, and we need to strive harder to understand how our systems work.

The opening session room before it filled up.

A Tale of Two Postmortems - A Human Factors View

Speaker: Tanner Lund (Microsoft, Azure SRE) with help from role playing actors whose name I did not catch but they should be on the recording.

Details: A Tale of Two Postmortems

Role play: A Post Mortem discussion. It was more like an interrogation around human error and how to prevent it in the future (monitoring, playbooks, slowing down releases, re-thinking if automation was the right choice).

Second role play: A conversation focused around what happened, how the on-caller felt, what information they had at that time and how that lead to the conclusions that were drawn. This surfaced alert fatigue and also that the way the systems are run is exhausting to the on-callers. To me also some of the processes looked sub-optimal. By changing the way people talk about what happened these very interesting facts were discovered. The overall impression of the second role play was that it provided much more psychological safety.

My key takeaways:

Availability: Thinking beyond 9’s

Speaker: Kumar Srinivasamurthy (Microsoft, Bing)

Details: Availability—Thinking beyond 9s

Any large system has outages. Kumar gave some examples of which we as an industry are not short of.

Bing had an incident and the product did not present the actual search results but only sponsored results (ads). Not really a degraded response because serving only ads has the potential to scare users away.

Question from the speaker: What is a good availability number? Audience: It depends! Smart 😅

First example: Parking meter. What availability is really needed? It’s likely enough if it works on weekdays. It should reliably collect money. Occasional takedowns for maintenance are not hurting much.

Next example: Flying. Given 100k flights a day, what availability do we need? Three nines (99.9%) would result in multiple disaster per day. Even five or six nines would still be too much. Interesting, I never thought about how many nines flying has. Of course, this example is too simplified. But it serves as a good starting point for thinking about nines.

The talk progressed looking into typical challenges of measuring and monitoring availability and reliability. I found most points rather high level and think they are already covered in the SRE books.

Fun quote: Adding nines since 2009!

Use Interview Skills of Accident Investigators to Learn More about Incidents

Speaker: Thai Wood (Resilience Roundup)

Details: Use Interview Skills of Accident Investigators to Learn More about Incidents

My first thought was: Lamp-in-the-face-style interrogation. Immediately Thai made clear this is not about interrogation! It is about respectfully interviewing people to understand an incident. Research has shown that it makes a significant difference how you ask questions.

Leading without Managing: Becoming an SRE Technical Leader

Speaker: Todd Palino (LinkedIn)

Details: Leading without Managing: Becoming an SRE Technical Leader

Todd shared his story of how he became a technical leader despite having spent about a decade at VeriSign. His words, my interpretation. 😇

There is nothing wrong with banging out code.

At LinkedIn, he reported, they have a clear way of measuring engineers. You might get a promotion on the back of one or two projects. Or because of tenure. They have clearly defined titles and what they mean. The natural progress is chasing titles, but not in a bad way. Interesting.

He pointed out that being a senior staff engineer means something different at other companies. There is no formal definition that spans the industry.

Here’s what LinkedIn has come up with:

The three pillars of Engineering at LinkedIn:

Then digging into leadership! It’s hard to determine if someone is a good leader. Roles are increasingly less clear defined when one moves into leadership. Some say: You define your role! How helpful. 🙃

Find your happy path. What do you like? Is it management meetings and moving things forward? Is it cranking out code? Is it thinking where technology should head to? What kind of leadership do you want for yourself?

One way to find out could be project work which comes with the following aspects:

This talk was so full of content that I had a hard time following and taking notes at the same time. Here are some unsorted items from the talk:

Can’t help myself but I have to point out that I found it incredible hard to read some of the slides. The colors were not the best choice. 🙈 Nevertheless, good talk and Todd is a great speaker!

Talks from LinkedIn being at the forefront of SRE and culture becomes a repeating pattern. These folks really know about and reflect on their craft.

We were served colorful, tasty food.

Let’s Build a Distributed File System

Speaker: Sanket Patel (LinkedIn)

Details: Let’s Build a Distributed File System

Sanket first explained how basic file system operations such as listing, reading, and writing are related to inodes. Then he went on to design a simple distributed file system. Distributed meaning in this context to run one file system in a fault-tolerant way across multiple networked hosts with disks attached. From my experience the crucial point in designing such a system is locking and consistent metadata management. Sanket solved this by introducing a master server that serves all metadata and minion servers that store the data blocks. Clients talk to the master and the minions to perform file system operations.

I’d argue that this is not really solving the problems we usually face when dealing with distributed file systems. The master, for example, poses a risk because it is a single point of failure. The server is threaded but I am missing locking and synchronization. Data consistency is not guaranteed by this design. On the other hand Sanket warned in the talk description that this will be a very basic implementation. After all, this talk appeared in the Core Principles track which is designed for exactly this type of foundational education.

The source code is public and described on Sanket’s blog. Great work, Sanket!

Shipping Software with an SRE Mindset

Speaker: Theo Schlossnagle (Circonus)

Details: Shipping Software with an SRE Mindset

Circonus uses C for high performance applications. One of the challenges (like, next to debugging an application written in C) is to expose metrics. Naturally, they had to write their own metrics library that works concurrently and fast.

Random notes:

The talk included code samples. It was good to see and think some C for a change. Go has taken over basically all my coding and I don’t even read C code anymore. 😢 Good ol’ C.

An unrelated herd of emojis trapped in a Singapore mall.

Networking at the Google Booth

An interesting experience for someone who is more of an introvert in face-to-face conversations. I talked to people who approached our booth and learned how many companies are out there doing products that are similar to Google products and also very successful. Getting myself out of Europe every now and then continues to be a great experience.

I also had some non-technical conversations. I had the chance to once again share my theory of the incompleteness of the German language. For example, we have an adjective for being hungry (hungrig) and one for not being hungry anymore after eating something (satt). However, for being thirsty (durstig) there is no such opposite adjective. We tried to make one up twice. The first time we could not decide on a word. The second time the word (sitt) was not really accepted by the wider population.

Anyway, I did let my mind wander… What an exhausting first day. Stay tuned for tomorrow!

Day Two

Welcome, welcome to another day in Singapore full of interesting talks at SREcon Asia/Pacific 2019. Here’s the gist of what I listened to, talked about, and learned today. I spend more time in technical talks today. I needed a bit of an deep dive to contrast the high-level topics from yesterday. The inner engineer is strong with me. 👷‍♂️I am glad SREcon offers both! Kudos to the program committee.

The Early Bird

Speaker: Me (Google) about work that predates my joining date and is unrelated to my employer.

Details: Implementing Distributed Consensus

I shall not toot my own horn. I let others decide if that was worth their time. Here’s the source code for your interest: The Skinny Distributed Lock Service

The Skinny Distributed Lock Service in action.

Edge Computing: The Next Frontier for Distributed Systems

Speaker: Martin Barry (Fastly)

Details: Edge Computing: The Next Frontier for Distributed Systems

Martin pointed out that the talk was about personal observations and not related to Fastly.

Martin wants to spark a discussion about Edge Computing within the community and shared his thoughts and definition.

Executing non-trivial functionality as close to the client as is reasonable.

Martin presented a hierarchy of from where a request can be served ranging from the Origin (e.g. non-cached original response) down to on the Client itself.

One challenge is that subsequent requests can end up at different levels of that hierarchy or different serving processes at the same level. I assume cache management is challenging for such setups. Martin says scalable solutions should depend on as little state as possible.

Typical applications for Edge Computing are:

Interesting: How does one do Edge Computing anyway? I learned that the runtime is often provided by the entity running the edge cache/compute resources. I wasn’t aware of that and somehow thought I would own the full stack. Common practice is using Domain Specific Languages (DSL) or containers. With containers leading to having multiple copies of the same data in memory which is wasteful.

The new hotness are these two fellas:

New challenges:

It totally makes sense! Why not push WASM to whatever runtime is provided? It could even be the client’s browser if the application allows that. Very informative talk! 👍

Critical Path Analysis - Prioritizing What Matters

Speaker: Althaf Hameez (Grab)

Details: Critical Path Analysis - Prioritizing What Matters

Althaf asks: Do we have a subset of services that are critical to our core business running? The definition of core business here being:

Any impact on our system that impacts the ability of a passenger to get a car to get safely from Point A to Point B.

It may not have all the bells and whistles of the fully fledged experience but it gets the core business done.

Apparently, where Grab operates cash is still a thing (Althaf: Cash is King). So that means cashless payments are not really in the critical path. That ruled out a number of services that are not critical. If you ever worked with payment processing you might know how much complexity is in that sometimes.

To validate the critical path they ran it in the testing environment first. Then in production. Scary? Kind of, but Grab operates in south-east Asia only and therefore sees a pretty cyclic user pattern. They were able to relatively test their hypothesis in production during night.

On an organizational level Althaf sought executive sign-off for the riskier things and involved the product teams early. On a technical level circuit breakers did the job. Caching helped to degrade gracefully, for example geo fences and city details rarely change and can be safely cached for a while.

Sometimes, Althaf reports, it is OK to fail. For example, a ride usually ends with the passenger seeing a dialog to rate the driver. However, when the rating service is failing the dialog won’t go away. With the dialog in the way the passenger can not book a new ride. While rating a driver is an important feature it is not essential enough to block new bookings. Nowadays the Grab app continues to let you book rides even if the rating failed. Shows what a critical path analysis can surface. In hindsight it sounds obvious but we all know how often our systems surprise us with things that should be obvious, right? 😲

A karaoke booth trying to lure me away from my critical path to SREcon. I love Asia.

Collective Mindfulness for Better Decisions in SRE

Speaker: Kurt Andersen (LinkedIn)

Details: Collective Mindfulness for Better Decisions in SRE, Slides

I only recently started meditating (can recommend) which means I am still at the beginning of the mindfulness journey. Mindfulness has become an interesting and helpful tool for me. So my expectations were quite high on Kurt’s talk. I was not disappointed. 🤩

The concept of collective mindfulness has emerged in the last 20 years in research. Kurt kicked it off with giving examples of mindlessness. He defined it as automatic and reactive behavior. Like staring at the same dashboards again and again and reacting the same way. There is a lot of potential for mindlessness in SRE.

Kurt describes the characteristics of the environment we work in as VUCA:

Mindfulness, in contrast, is being able to reflect on our behavior and identify improvements in how we react to VUCA.

Collective Mindfulness: Capability to discern discriminatory detail about emerging issues and to act swiftly in response to these details.

One SRE-related aspect of collective mindfulness is the reluctance to accept simple solutions. I can think of it must have been a network blibb as a common explanation that some teams use instead of properly investigating an issue.

The aspects include:

To continue practicing collective mindfulness a team needs to know where to look instead of looking at all the things. I sense this is something we need intuition for?!

Here’s what to do:

From the military we learned about the STICC model of communication:

Kurt called this form of communication ritual and emphasized that it is a useful tool for relaxing a stressful situation. It fits in well with other rituals we have in SRE, like writing post mortems or following up on action items by setting up project work with a clear goal. From my time serving as an army officer I can confirm that ritualized communication can be a great tool in the right situations. The more it is practiced the better it works when things go so wrong that there is no time for questioning the general approach.

Dangers to collective mindfulness are our IT systems itself when they make us perform routine actions or have hard to understand automation.

Linux Memory Management at Scale: Under the Hood

Speaker: Chris Down (Facebook)

Details: Linux Memory Management at Scale: Under the Hood

At the beginning of the talk Chris went over fundamentals on Linux resource management such as cgroups. Resource management is a tricky business. For example, if you memory-limit an application too much you basically transform memory into disk I/O. Not really a win. Chris claims even seasoned SREs often have misconceptions about how memory at scale works.

After ranting about the resident set size as a metric Chris comes to the conclusion that unless we heavily measure and instrument an application we can not really tell how much memory it uses.

Swap, however, deserves a better reputation than it has. It’s not an emergency memory although it is often seen as that. Having a swap, he argues, is like running make -j cores+1. That is putting a little bit of pressure on the memory to really squeeze the last bit of performance out of it. Without a swap there would still be disk I/O, e.g. file system cache pages being evicted by writing them to disk.

The OOM killer shall not be trusted. It is always late to the party. It doesn’t really know what to kill. That means it may go to the wrong party. To avoid this the Linux kernel (via kswapd) tries to reclaim the coldest pages. Some pages may be not be reclaimable that way. Swap can help here by moving them to disk temporarily.

Quiz for the audience: What metric do you look at to find out if there is a memory resource issue? Audience ideas:

Chris thinks memory pressure is a much better metric. I agree. I’m a big fan of memory pressure and most of the time it is the only metric I have to look at to rule out memory issues in a system. 👀

Operating systems have many consumers of memory: user allocations, file caches, network buffers, etc… Memory pressure happens when there is a shortage of memory. It represents the the work that Linux (or any other OS) does in order to manage and shuffle memory around to satisfy the system’s many users.

A graph showing how Facebook’s own OOM killer engages before the kernel does.

TIL: Facebook has a user-space OOM killer. I am not sure what I should think about that. 🤔 I have the feeling that Facebook runs machines very differently from Google although both companies work together in moving things forward.

Chris continued by showed how Facebook slices server resources, most importantly memory. They use cgroups and he showed some of their initial and improved cgroup hierarchies. I suggest watching the video once it is out because I couldn’t keep up with taking notes.

Cross Continent Infrastructure Scaling at Instagram

Speaker: Sherry Xiao (Facebook)

Details: Cross Continent Infrastructure Scaling at Instagram

I was a bit late to Sherry’s talk which I am sorry for. I missed the first 5 minutes and when I entered Sherry was already deep into sharded Cassandra databases.

Instagram migrates data between regional clusters when a user moves to a different continent. This is really nice! Out-of-region users (e.g. frequent travelers or digital nomads) are a pain to latency-critical databases. Jumping oceans on every request is expensive and frustrating.

Instagram uses counters to decide on cross-regional data migrations.

Forming a quorum in Cassandra depends on the replication factor. With only so much data centers in a region it may be necessary to include out-of-region nodes in the quorum.

The EU Cassandra quorum contains a cluster in the U.S.

The talk was short but the slides were to the point and the delivery was really good. I enjoyed the talk very much although it brought back some unpleasant memories regarding my own encounters with Cassandra. 😂🙈

Software Networking and Interfaces on Linux

Speaker: Matt Turner (Native Wave)

Details: Software Networking and Interfaces on Linux

Spontaneously, I ended up in the Core Principles track once again. Matt started with all the basics on Ethernet, IP, interfaces, DHCP, bind() and accept(). I’m not going to write all that down since it has been discussed at length all over the Internet already.

It got interesting to me when Matt discussed how a process emits packets using software interfaces, such as a tun device (Layer 3). This is what one needs when building a VPN. Once the device is turned into a tap device it operates on layer 2. This allows for more crazy stuff to be implemented, such as the highly elaborate IPoWAC protocol.

Next thing Matt showed was a br (bridge) device (a virtual switch) and its properties. Then the bridge was compared to the ovs (OpenVSwitch) device. Matt moved on to VMs and quickly discussed the virtio memory page-sharing virtual network device we often see in Linux KVM guests. Since this is 2019 and everything is containers now the next device we looked at was the veth virtual ethernet pair (note that veth devices always come in pairs and usually span network namespaces). Finally Matt wanted to make two containers talk to each other without using a bridge. The macvlan device did the trick because it’s simple. I’d argue it creates kind of a bridge although we don’t have to deal with STP and other exciting features. If you want to go really crazy there is always the ipvlan device. Think twice. 👆️

There was nothing new for me in this talk but it was a good talk going through and un-confusing Linux software networking. The Core Principles track sometimes is like rolling dice. Nevertheless, I think it is important we have this track and that we address all levels. After all, networking is an important, sometimes undervalued area of SRE expertise.

Must Resist…

The SREcon is hosted by the Suntec convention center. It’s a fancy place with spacious rooms and modern equipment. Its entrance features a huge wall of full HD screens.

A wall of hundreds of full HD screens asking me to make an impact.

I am curious how such a crazy amount of screens is being controlled. It seems I am not the only one who noticed the nearby WiFi network named “LG_signage”. Coincidence? 🤔 I don’t think so!

Anyway, this is SREcon. DEFCON is still two months away. I shall be a good citizen! But it is not easy. Look at this display! It is even bragging about itself!

The giant screen of screens asking to be played with…

Day Three

Phew! Day three comes to an end here in Singapore! It’s Friday evening which means the weekend is about to start. 🎉 Sorry European and American folks, you have a few more hours to go.

We had a generous dinner reception last night with highly attentive staff. That may have lead to one or the other SRE exceeding their own SLO for beverages. At least that was my impression when I look at the tail latency for showing up for the first talks of the day. 🤨 Anyway, let’s sum up what’s been hot today!

Getting More out of Postmortems and Making Them Less Painful to Do

Speaker: Ashar Rizqi (Blameless)

Details: Getting More out of Postmortems and Making Them Less Painful to Do

Blameless Inc. is rumored to be the new star in the SRE tools realm. Therefore I was interested to learn how they think about post mortems.

Ashar asked the audience to agree on the benefits of doing post mortems by show of hands:

Then Ashar moved on to case studies from Box, Digital Ocean, and Home Depot. And another few rounds of show of hands. He likes show of hands. ✋✋

The typical post mortem process as seen in Blameless’ case studies.

Who should own the post mortem?

Who to ensure on-time completion?

I have some thoughts on this. Some of the suggested solutions are actually big bombs to set off. Escalating to C-level is something that one can pull only so often. Gamifying is always a risky move. People will play the game.

Who collects all the details in a timely manner?

According to Ashar the Incident Commander is responsible for pulling in all the information needed. Once again, a synchronous process via chat app is suggested.

How to track action items?

How to foster a blameless language?

I missed the remaining points because the next slide was up already. Presentation speed up? 🏎

What is still debated or unresolved?

Conclusion

I was thinking of skipping this talk initially. After all, everyone believes Google has figured out and perfected the post mortem process. However, I have learned about different ways and processes to create, own, and follow up on post mortems. Most interestingly, many companies seem to prefer synchronous tools like meetings or chat. I am more used to a primarily asynchronous process. It was also nice to see how post morten and post mortem process ownership is handled at other companies. Big thanks to the audience for openly sharing their internal processes and structures!

The MTTR Chronicles: Evolution of SRE Self Service Operations Platform

Speakers: Jason Wik, Jayan Kuttagupthan, and Shubham Patil (VMware)

Details: The MTTR Chronicles: Evolution of SRE Self Service Operations Platform

Jason introduced the challenges that the SRE team at VMware is facing. They aim to reduce the MTTR (Meant Time To Recovery) in a landscape of diverse multi-environment infrastructure that looks very different for each customer. Basically our SRE nemesis: Complexity! 🤯

A complex environment.

Shubham continued by highlighting how the team approached the challenges. They created a platform called North Star:

An Extensible, Dynamic, and Collaborative platform to reduce MTTR and improve operational efficiency for unique and constantly changing environments.

They integrated different platforms into a single user interface:

However, these were basically tabs in a larger tool.

An engineer is overwhelmed by a multitude of tools and processes.

Jayan proceeded to show how they correlated the data from the different tabs (platforms) onto a single pane of glass. That is, correlating the data and providing a 360 degree view of incidents. By having alerts and on-going automation tasks correlated an SRE can quicker triage and thus reduce the MTTR. Having cause, symptom, and action presented in a single place helped to react faster. The UI was reported to be updated in real-time and responsive. Also: Integration with Business Intelligence and ticketing systems.

Unfortunately there were no screenshots or demos of the platform. I would have really loved to see what this wondrous platform looks like in practice. 😳

Building Centralized Caching Infrastructure at Scale

Speaker: James Won (LinkedIn)

Details: Building Centralized Caching Infrastructure at Scale

James is part of the Caching as a Service (CaaS) team at LinkedIn. Pretty cool: He used slido’s live polling feature in his presentation. Very nerd-friendly way of involving the audience.

At LinkedIn teams were frustrated with operating memchached. As a drop-in replacement they decided to use Couchbase. Then the usage exploded to over 2000 hosts in production. These were managed by different teams. That caused some problems:

LinkedIn decided to create a caching team that manages caching in a centralized way at scale. The team had three main goals:

The CaaS team now provides:

Depending on the use case the caching may be backed up by SSD, HDD, or pure memory. The team also provides dashboards and metrics to the teams. Furthermore, as the caches are fully managed, the CaaS team takes care of OS and software updates. However, they don’t want to own the data and they do not own backups. The teams know best what is acceptable for their data and therefore own the data. The CaaS team runs about 2000 hosts serving over 10 million QPS across multiple clusters.

One of the slido questions used by James to engage the audience.

One of the challenges they encountered was GDPR and configuration management. To deal with the configuration challenge they created a wrapper around Couchbase. The wrapper was build in a way that it could be plugged into the existing deployment process and tools. Another challenge was to run Couchbase as a non-root user which, at that time, was not supported by Couchbase. James reports significant code changes and overhauling the deployment process. They were able to pull it off without customer noticing. Nice!

Lessons learned:

Next steps for the CaaS team is provide a self service for creating caching buckets. Very good talk that provided interesting insights.

Hybrid XFS - Using SSDs to Supercharge HDDs at Facebook

Speaker: Skanda Shamasunder (Facebook)

Details: Hybrid XFS - Using SSDs to Supercharge HDDs at Facebook

Skanda started strong by claiming: A stupidly simple solution that looks risky on the outside can greatly improve performance.

The IO Wall

More bytes on disks but due to the stalling number of arms the disk seeks remain the same. Workloads getting hotter, ML, video streams, Means: today we buy more disks to get more IO and not to get more storage

The IO Wall defined as the point were IO and capacity boundaries cross.

Opportunity

Facebook uses XFS. An interesting number of metadata writes were happening they noticed. About a quarter of the IO was spent on metadata writes. Then they stumbled upon XFs real-time mode. A little known feature that puts data and metadata on different devices. So they thought why not put metadata on SSDs were IOPs are cheap and put data on HDDs where bytes are cheap? That’s just what they did in an experiment. The experiment went really well. They nearly eliminated random writes. Now they can use bigger disks and fully utilize them.

Date goes to spinning disk while metadata ends up on SSD.

At this point in the talk Skanda fooled the audience. Not spoilers, go watch the recording once it is out!

Risks

SSD failures? With the metadata lost they would also lose the data. And what if the workload changes? Is it worth rolling out such a fundamental change to the fleet?

Analysis

SSDs die less often than HDDs. But they take the HDDs with them due to the metadata being important. What if we multiple SSDs die at once? Solution is to replicate the metadata. They also ran endurance tests that went overall well. Next item to analyze was if buying SSDs to utilize the HDDs better was a good business move. It turns out to be a money saver due to not having to buy thousands of new disks to increase the IO.

Rollout

How to roll this change out to tens of thousands of hosts? Carefully and with automation!

Lessons learned

Extending a Scheduler to Better Support Sharded Services

Speaker: Laurie Clark-Michalek (Facebook)

Details: Extending a Scheduler to Better Support Sharded Services

Sharded services refers to workloads that need access to a shard of data. It took me a while to get used to what scheduling means in this context. I am used to a different mode of running services. A scheduler at Facebook seems to be responsible for scheduling and de-scheduling tasks on hardware hosts.

The central piece is a scheduler that knows about machine health, updates, and upcoming maintenance. The scheduler, however, has to ask the service if it can proceed with the planned scheduling operation. The service might decline being re-scheduled.

Trivia: At lunch Laurie and I talked about engineering culture at Facebook. I asked him if he had a single emoji to describe Facebook what it would be. My guess: 🤠 His actual answer: 🙃

I have the feeling how the scheduler is designed is pretty much influenced by Facebook’s engineering culture which gives service owners significant freedom and say down to the hardware level.

In the end it is about a trade-off of power. Should service owners be allowed to block a task migration? Should they be able to block rack or data center drains? Or do you want a scheduler that always wins?

The overall question may be: Can we make schedulers aware of sharded services and their special needs? Maybe even in Kubernetes? Do we want this at all?

Yes, No, Maybe? Error Handling with gRPC Examples

Speaker: Gráinne Sheerin (Google)

Details: Yes, No, Maybe? Error Handling with gRPC Examples

I was looking forward to this talk because I stumbled upon gRPC error handling in the past. If a service’s response doesn’t equal OK it gets interesting. GRPC has ~16 status code to indicate errors. Only a few of them can be issued by the auto-generated code and the gRPC library. All status codes can be used by the application developer to communicate the reason for an error.

Gráinne walked us through a couple of interesting error cases. For example, DEADLINE_EXCEEDED can mean a request never reached the server. It can also mean everything worked but the response came in too late at the client stub. Then the client stub overrides the OK status code with DEADLINE_EXCEEDED. You can’t use the response and you wasted resources on the server side. Mind blown. 🙀 I think I never thought about the stubs really. I just assumed they are there and transparent.

A better approach is to check the deadline on the server side and cancel the request there if it can’t possible be served in time. It gets more confusing if the server sets the status code to DEADLINE_EXCEEDED instead of CANCELLED. I feel like debugging this can be painful. If metrics are used for troubleshooting one should be aware which stub the specific metrics were collected at.

TL;DR:

Ethics in SRE

Speakers: Laura Nolan (Slack) and Theo Schlossnagle (Circonus)

Details: Ethics in SRE

History time: Civil engineering had it’s fair share of disasters that killed people. Engineers were the ultimate experts on the things they were building and as professionals they started caring about doing things right. Laura and They argue our profession is at a similar point today. We have to start thinking more about serving society and demanding ethical standards for the work we do. Similar to civil engineers not building bridges that have too weak of a structure to carry the load, not even when their employers demand it.

Theo: There are doctors and lawyers who are not allowed to practice anymore because they violated ethical standards. This is what maintains trust between the profession in general and society.

Laura: A computer system is not like a bridge, ethically speaking. It is even more complex. Rationale: We can’t easily inspect computer systems. Not that bridges are easy to inspect, they are at least physically accessible for inspection. Complex computer systems are ever-changing and inspection seems infeasible.

Theo: We are, as an industry, in our thirties but we are behaving like 4-year-olds.

They discussed a bunch of examples that I think are best experienced first hand. So instead if writing it down I allow myself to redirect the reader to the video recording (to be released in a couple of weeks).

It’s an incredible important discussion we have to have. Otherwise we end up being tightly regulated or loose our creative freedom.

Related: Tanya Reilly’s talk on The History of Fire Escapes at SREcon Americas 2018.

It has been decided!

In the last couple of weeks the community discussed what a good collective noun for a group of SREs would be. Obviously, a rant of SREs is the best option. But since we like democracy we had a vote on it. The winner is a cluster of SREs. I guess that is democracy, having a reliable second-best option and run with it.

The rant not being the most favored collective noun for a group of SREs.

Conclusion

This was another great SREcon! Thanks fly out to the program chairs, the program committee, the speakers, the room captains, the helping hands in the background, and the companies that encouraged their employees to share content and supported with travel and diversity grants. Thank you!

I had interesting conversations and now look forward to some chillaxing before I leave again tomorrow. I’ll miss Singapore!

Today it was raining again. Please have a skyline picture from two days ago when we still had sun.