SREcon Asia/Pacific 2019: Day One

Good Morning, Europe! Please enjoy my SREcon Asia/Pacific 2019 day one conference notes fresh out of the press from beautiful Singapore. This year I signed up for volunteering at the Google both and for serving as room co-captain at a couple of sessions. The reports will probably a bit shorter than last year’s.

Opening

SREcon APAC is growing like crazy! The program co-chairs Frances Johnson (Google) and Avleen Vig (Facebook) opened the event welcoming attendees and sponsors. They also provided data:

  • Over 580 attendees!
  • Attendees from 24 countries and 125 companies
  • The program committee consisted of 24 members
  • 73 speakers were selected to give one or more talks

It’s always interesting how attendees self-identify regarding their role. This time we have 58% engineers, 27% managers, while the rest identifies as tech lead or other. Last year I was in the role of tech lead but I have changed sides and now I am counting against the growing number of managers that are interested in SRE.

The organizers tried something new this year: Slido table topics. That is, attendees are invited to suggest and vote on topics. The most popular ones are printed on paper and put on some of the lunch tables. These tables are thus designated for people interested in the table topic. This helps kickstart conversations in a birds-of-a-feather kind of way.

This year’s theme is Depth of Knowledge. Our industry is expanding, we deal with increasing complexity over time, and we need to strive harder to understand how our systems work.

The opening session room before it filled up.

A Tale of Two Postmortems - A Human Factors View

Speaker: Tanner Lund (Microsoft, Azure SRE) with help from role playing actors whose name I did not catch but they should be on the recording.

Details: A Tale of Two Postmortems

Role play: A Post Mortem discussion. It was more like an interrogation around human error and how to prevent it in the future (monitoring, playbooks, slowing down releases, re-thinking if automation was the right choice).

Second role play: A conversation focused around what happened, how the on-caller felt, what information they had at that time and how that lead to the conclusions that were drawn. This surfaced alert fatigue and also that the way the systems are run is exhausting to the on-callers. To me also some of the processes looked sub-optimal. By changing the way people talk about what happened these very interesting facts were discovered. The overall impression of the second role play was that it provided much more psychological safety.

My key takeaways:

  • Learn from what happened. It is not about how to prevent this specific type of incident in the future. However, this will probably be a fortunate outcome of whatever action items are agreed upon.
  • We assume everyone’s decisions made perfectly sense at the time they were made.
  • Don’t jump to conclusions action items right away.

Availability: Thinking beyond 9’s

Speaker: Kumar Srinivasamurthy (Microsoft, Bing)

Details: Availability—Thinking beyond 9s

Any large system has outages. Kumar gave some examples of which we as an industry are not short of.

Bing had an incident and the product did not present the actual search results but only sponsored results (ads). Not really a degraded response because serving only ads has the potential to scare users away.

Question from the speaker: What is a good availability number? Audience: It depends! Smart 😅

First example: Parking meter. What availability is really needed? It’s likely enough if it works on weekdays. It should reliably collect money. Occasional takedowns for maintenance are not hurting much.

Next example: Flying. Given 100k flights a day, what availability do we need? Three nines (99.9%) would result in multiple disaster per day. Even five or six nines would still be too much. Interesting, I never thought about how many nines flying has. Of course, this example is too simplified. But it serves as a good starting point for thinking about nines.

The talk progressed looking into typical challenges of measuring and monitoring availability and reliability. I found most points rather high level and think they are already covered in the SRE books.

Fun quote: Adding nines since 2009!

Use Interview Skills of Accident Investigators to Learn More about Incidents

Speaker: Thai Wood (Resilience Roundup)

Details: Use Interview Skills of Accident Investigators to Learn More about Incidents

My first thought was: Lamp-in-the-face-style interrogation. Immediately Thai made clear this is not about interrogation! It is about respectfully interviewing people to understand an incident. Research has shown that it makes a significant difference how you ask questions.

  • Who should drive the interview? It’s the person who is being interviewed. They were there, they know more than the interviewer. They should be speaking most of the time.
  • Expectations: When you ask people something they want to give an answer, which puts some pressure on answering even when the facts are not well remembered. A better technique would be to give people space to tell their story and make clear that there is no specific expectation. This is reported to work surprisingly well.
  • Question type: Obviously you want to ask open ended questions.
  • How much do you want to know? Ask: Tell me everything even if it is out of order or seems irrelevant! There is a ton of value in the details even if they seem unrelated. You can sort out later what was really relevant.
  • Don’t interrupt! Should be obvious but seems to still happen.
  • Don’t follow a template. Don’t use a form. It will constrain what people will tell you.
  • Adopt the questions to the person being interviewed as the interview progresses.
  • Make the setting safe and encourage people to say I don’t know that if they don’t know. Otherwise they will (most likely unintentionally) make up answers.

Leading without Managing: Becoming an SRE Technical Leader

Speaker: Todd Palino (LinkedIn)

Details: Leading without Managing: Becoming an SRE Technical Leader

Todd shared his story of how he became a technical leader despite having spent about a decade at VeriSign. His words, my interpretation. 😇

There is nothing wrong with banging out code.

At LinkedIn, he reported, they have a clear way of measuring engineers. You might get a promotion on the back of one or two projects. Or because of tenure. They have clearly defined titles and what they mean. The natural progress is chasing titles, but not in a bad way. Interesting.

He pointed out that being a senior staff engineer means something different at other companies. There is no formal definition that spans the industry.

Here’s what LinkedIn has come up with:

  • Junior SRE: Is task driven. Completes a task, gets the next task, etc.
  • SRE: Has a bit more autonomy, develops automation, shares knowledge.
  • Senior SRE: Is a stack expert. Drives design and collaboration. Mentors and probably receives mentorship themselves.
  • Staff SRE: Moving into leadership. Has deep knowledge about multiple stacks. Thinks about what comes up next. Supports their manager to focus on people management by being the technical right hand to them.
  • Senior Staff SRE: Seasoned Staff SRE with even more oversight and responsibilities.

The three pillars of Engineering at LinkedIn:

  • Execution: Getting Things Done
  • Craftsmanship: Doing Things Well
  • Leadership: Enabling Others to Get Things Right

Then digging into leadership! It’s hard to determine if someone is a good leader. Roles are increasingly less clear defined when one moves into leadership. Some say: You define your role! How helpful. 🙃

Find your happy path. What do you like? Is it management meetings and moving things forward? Is it cranking out code? Is it thinking where technology should head to? What kind of leadership do you want for yourself?

One way to find out could be project work which comes with the following aspects:

  • Engagement: Coordinating multiple teams
  • Impact: Change your organization
  • Inception: Who came up with the idea?
  • Process: Not all projects are technical

This talk was so full of content that I had a hard time following and taking notes at the same time. Here are some unsorted items from the talk:

  • Mentoring: As a leader you want (I’d even say: need) to be a mentor and you want to be mentored yourself.
  • Looking outside the industry. Find a toastmasters chapter nearby and train public communication.
  • Meetups are also a place where you can practice talks and sharing knowledge. Also builds up a network.
  • Get more exposure and practice: Ignite talks, lightning talks, writing a blog, speak at conferences, write articles for magazines.
  • Short-form publishing. That is, writing very short books.
  • Engage in diversity, inclusion, and ethics.
  • How to identify meaningful work? Whenever you hear the phrase Someone should… you could be that someone!
  • Keep a paper trail. Document your projects, comment your code, turn tribal knowledge into artifacts that others can use, keep a list of your accomplishments (also helpful for promotion).
  • Stay organized. It’s not about having every minute of your day being planned out. But having a rough idea where you want to spend your time is advised.

Can’t help myself but I have to point out that I found it incredible hard to read some of the slides. The colors were not the best choice. 🙈 Nevertheless, good talk and Todd is a great speaker!

Talks from LinkedIn being at the forefront of SRE and culture becomes a repeating pattern. These folks really know about and reflect on their craft.

small We were served colorful, tasty food.

Let’s Build a Distributed File System

Speaker: Sanket Patel (LinkedIn)

Details: Let’s Build a Distributed File System

Sanket first explained how basic file system operations such as listing, reading, and writing are related to inodes. Then he went on to design a simple distributed file system. Distributed meaning in this context to run one file system in a fault-tolerant way across multiple networked hosts with disks attached. From my experience the crucial point in designing such a system is locking and consistent metadata management. Sanket solved this by introducing a master server that serves all metadata and minion servers that store the data blocks. Clients talk to the master and the minions to perform file system operations.

I’d argue that this is not really solving the problems we usually face when dealing with distributed file systems. The master, for example, poses a risk because it is a single point of failure. The server is threaded but I am missing locking and synchronization. Data consistency is not guaranteed by this design. On the other hand Sanket warned in the talk description that this will be a very basic implementation. After all, this talk appeared in the Core Principles track which is designed for exactly this type of foundational education.

The source code is public and described on Sanket’s blog. Great work, Sanket!

Shipping Software with an SRE Mindset

Speaker: Theo Schlossnagle (Circonus)

Details: Shipping Software with an SRE Mindset

Circonus uses C for high performance applications. One of the challenges (like, next to debugging an application written in C) is to expose metrics. Naturally, they had to write their own metrics library that works concurrently and fast.

Random notes:

  • When in doubt: Expose telemetry!
  • A typical request can produce between 10 (ten) and 2000 (two thousand) spans from distributed tracing. Quite a variance! But the traces are not stored very long at Circonus.
  • Traces are used with reduced details due to performance. A full trace can be generated on request and might yield up to 4 GB of highly detailed debugging dump. Someone then has to dig through it. Fun!
  • Surprising approach: Tracing data is produced in memory and then pushed into a message queue. I haven’t really understood why this is superior but it does give systems the ability to subscribe to traces. 🤔
  • Theo is not a big fan of logs. He thinks logs should be designed for humans unless they are purely machine to machine communication.

The talk included code samples. It was good to see and think some C for a change. Go has taken over basically all my coding and I don’t even read C code anymore. 😢 Good ol’ C.

An unrelated herd of emojis trapped in a Singapore mall.

Networking at the Google Booth

An interesting experience for someone who is more of an introvert in face-to-face conversations. I talked to people who approached our booth and learned how many companies are out there doing products that are similar to Google products and also very successful. Getting myself out of Europe every now and then continues to be a great experience.

I also had some non-technical conversations. I had the chance to once again share my theory of the incompleteness of the German language. For example, we have an adjective for being hungry (hungrig) and one for not being hungry anymore after eating something (satt). However, for being thirsty (durstig) there is no such opposite adjective. We tried to make one up twice. The first time we could not decide on a word. The second time the word (sitt) was not really accepted by the wider population.

Anyway, I did let my mind wander… What an exhausting first day. Stay tuned for tomorrow!



🔬 Experimental Feature: Subscribe here to receive new articles via email! 🔬