Welcome, welcome to another day in Singapore full of interesting talks at SREcon Asia/Pacific 2019. Here’s the gist of what I listened to, talked about, and learned today. I spend more time in technical talks today. I needed a bit of an deep dive to contrast the high-level topics from yesterday. The inner engineer is strong with me. 👷♂️I am glad SREcon offers both! Kudos to the program committee.
The Early Bird
Speaker: Me (Google) about work that predates my joining date and is unrelated to my employer.
Details: Implementing Distributed Consensus
I shall not toot my own horn. I let others decide if that was worth their time. Here’s the source code for your interest: The Skinny Distributed Lock Service
The Skinny Distributed Lock Service in action.
Edge Computing: The Next Frontier for Distributed Systems
Speaker: Martin Barry (Fastly)
Martin pointed out that the talk was about personal observations and not related to Fastly.
Martin wants to spark a discussion about Edge Computing within the community and shared his thoughts and definition.
Executing non-trivial functionality as close to the client as is reasonable.
Martin presented a hierarchy of from where a request can be served ranging from the Origin (e.g. non-cached original response) down to on the Client itself.
- Continental Regions
- Data centers in major cities (carrier-neutral, Internet Exchanges)
- Internet Service Providers (ISP)
- Last mile
One challenge is that subsequent requests can end up at different levels of that hierarchy or different serving processes at the same level. I assume cache management is challenging for such setups. Martin says scalable solutions should depend on as little state as possible.
Typical applications for Edge Computing are:
- Request normalization
- Authentication or Paywall
- Vary by user agent
- A/B Testing
Interesting: How does one do Edge Computing anyway? I learned that the runtime is often provided by the entity running the edge cache/compute resources. I wasn’t aware of that and somehow thought I would own the full stack. Common practice is using Domain Specific Languages (DSL) or containers. With containers leading to having multiple copies of the same data in memory which is wasteful.
The new hotness are these two fellas:
- WebAssembly (WASM)
- WebAssembly System Interface (WASI)
- Continuous integration, deployments, rollbacks, and of course testing on the edge. * The hardest part seems to be load testing. How to load test something that is far away from the origin but closer to the client?
- How to integrate provider metrics into your own metric processes and infrastructure?
- Did I mention distributed tracing?
- External health checks. They are usually run from well-connected data centers and not from the far edge. Oh my!
It totally makes sense! Why not push WASM to whatever runtime is provided? It could even be the client’s browser if the application allows that. Very informative talk! 👍
Critical Path Analysis - Prioritizing What Matters
Speaker: Althaf Hameez (Grab)
Althaf asks: Do we have a subset of services that are critical to our core business running? The definition of core business here being:
Any impact on our system that impacts the ability of a passenger to get a car to get safely from Point A to Point B.
It may not have all the bells and whistles of the fully fledged experience but it gets the core business done.
Apparently, where Grab operates cash is still a thing (Althaf: Cash is King). So that means cashless payments are not really in the critical path. That ruled out a number of services that are not critical. If you ever worked with payment processing you might know how much complexity is in that sometimes.
To validate the critical path they ran it in the testing environment first. Then in production. Scary? Kind of, but Grab operates in south-east Asia only and therefore sees a pretty cyclic user pattern. They were able to relatively test their hypothesis in production during night.
On an organizational level Althaf sought executive sign-off for the riskier things and involved the product teams early. On a technical level circuit breakers did the job. Caching helped to degrade gracefully, for example geo fences and city details rarely change and can be safely cached for a while.
Sometimes, Althaf reports, it is OK to fail. For example, a ride usually ends with the passenger seeing a dialog to rate the driver. However, when the rating service is failing the dialog won’t go away. With the dialog in the way the passenger can not book a new ride. While rating a driver is an important feature it is not essential enough to block new bookings. Nowadays the Grab app continues to let you book rides even if the rating failed. Shows what a critical path analysis can surface. In hindsight it sounds obvious but we all know how often our systems surprise us with things that should be obvious, right? 😲
A karaoke booth trying to lure me away from my critical path to SREcon. I love Asia.
Collective Mindfulness for Better Decisions in SRE
Speaker: Kurt Andersen (LinkedIn)
I only recently started meditating (can recommend) which means I am still at the beginning of the mindfulness journey. Mindfulness has become an interesting and helpful tool for me. So my expectations were quite high on Kurt’s talk. I was not disappointed. 🤩
The concept of collective mindfulness has emerged in the last 20 years in research. Kurt kicked it off with giving examples of mindlessness. He defined it as automatic and reactive behavior. Like staring at the same dashboards again and again and reacting the same way. There is a lot of potential for mindlessness in SRE.
Kurt describes the characteristics of the environment we work in as VUCA:
Mindfulness, in contrast, is being able to reflect on our behavior and identify improvements in how we react to VUCA.
Collective Mindfulness: Capability to discern discriminatory detail about emerging issues and to act swiftly in response to these details.
One SRE-related aspect of collective mindfulness is the reluctance to accept simple solutions. I can think of it must have been a network blibb as a common explanation that some teams use instead of properly investigating an issue.
The aspects include:
- A preoccupation with failure: prevents failures by focusing on discovering incipient failures and their components.
- A reluctance to simplify interpretations (see above example)
- Operations: Recognizes that a solution to one problem may create another and therefore process-wide measurement is essential.
- A commitment to resilience: We make things better to not get paged again for the same reason.
- Recognizing the expertise of people running things, not necessarily the developers or architects
To continue practicing collective mindfulness a team needs to know where to look instead of looking at all the things. I sense this is something we need intuition for?!
Here’s what to do:
- Focus on failure
- Refusal to simplify
- Staying attuned to operations
From the military we learned about the STICC model of communication:
- Situation: Here’s what I think we face
- Task: Here’s what I think we should do
- Intent: Here’s why
- Concern: Here’s what we need to watch
- Calibrate: Now talk to me
Kurt called this form of communication ritual and emphasized that it is a useful tool for relaxing a stressful situation. It fits in well with other rituals we have in SRE, like writing post mortems or following up on action items by setting up project work with a clear goal. From my time serving as an army officer I can confirm that ritualized communication can be a great tool in the right situations. The more it is practiced the better it works when things go so wrong that there is no time for questioning the general approach.
Dangers to collective mindfulness are our IT systems itself when they make us perform routine actions or have hard to understand automation.
Linux Memory Management at Scale: Under the Hood
Speaker: Chris Down (Facebook)
At the beginning of the talk Chris went over fundamentals on Linux resource management such as cgroups. Resource management is a tricky business. For example, if you memory-limit an application too much you basically transform memory into disk I/O. Not really a win. Chris claims even seasoned SREs often have misconceptions about how memory at scale works.
After ranting about the resident set size as a metric Chris comes to the conclusion that unless we heavily measure and instrument an application we can not really tell how much memory it uses.
Swap, however, deserves a better reputation than it has. It’s not an emergency memory although it is often seen as that. Having a swap, he argues, is like running
make -j cores+1. That is putting a little bit of pressure on the memory to really squeeze the last bit of performance out of it. Without a swap there would still be disk I/O, e.g. file system cache pages being evicted by writing them to disk.
The OOM killer shall not be trusted. It is always late to the party. It doesn’t really know what to kill. That means it may go to the wrong party. To avoid this the Linux kernel (via kswapd) tries to reclaim the coldest pages. Some pages may be not be reclaimable that way. Swap can help here by moving them to disk temporarily.
Quiz for the audience: What metric do you look at to find out if there is a memory resource issue? Audience ideas:
- Disk I/O
Chris thinks memory pressure is a much better metric. I agree. I’m a big fan of memory pressure and most of the time it is the only metric I have to look at to rule out memory issues in a system. 👀
Operating systems have many consumers of memory: user allocations, file caches, network buffers, etc… Memory pressure happens when there is a shortage of memory. It represents the the work that Linux (or any other OS) does in order to manage and shuffle memory around to satisfy the system’s many users.
A graph showing how Facebook’s own OOM killer engages before the kernel does.
TIL: Facebook has a user-space OOM killer. I am not sure what I should think about that. 🤔 I have the feeling that Facebook runs machines very differently from Google although both companies work together in moving things forward.
Chris continued by showed how Facebook slices server resources, most importantly memory. They use cgroups and he showed some of their initial and improved cgroup hierarchies. I suggest watching the video once it is out because I couldn’t keep up with taking notes.
Cross Continent Infrastructure Scaling at Instagram
Speaker: Sherry Xiao (Facebook)
I was a bit late to Sherry’s talk which I am sorry for. I missed the first 5 minutes and when I entered Sherry was already deep into sharded Cassandra databases.
Instagram migrates data between regional clusters when a user moves to a different continent. This is really nice! Out-of-region users (e.g. frequent travelers or digital nomads) are a pain to latency-critical databases. Jumping oceans on every request is expensive and frustrating.
Instagram uses counters to decide on cross-regional data migrations.
Forming a quorum in Cassandra depends on the replication factor. With only so much data centers in a region it may be necessary to include out-of-region nodes in the quorum.
The EU Cassandra quorum contains a cluster in the U.S.
The talk was short but the slides were to the point and the delivery was really good. I enjoyed the talk very much although it brought back some unpleasant memories regarding my own encounters with Cassandra. 😂🙈
Software Networking and Interfaces on Linux
Speaker: Matt Turner (Native Wave)
Spontaneously, I ended up in the Core Principles track once again. Matt started with all the basics on Ethernet, IP, interfaces, DHCP,
accept(). I’m not going to write all that down since it has been discussed at length all over the Internet already.
It got interesting to me when Matt discussed how a process emits packets using software interfaces, such as a
tun device (Layer 3). This is what one needs when building a VPN. Once the device is turned into a
tap device it operates on layer 2. This allows for more crazy stuff to be implemented, such as the highly elaborate IPoWAC protocol.
Next thing Matt showed was a
br (bridge) device (a virtual switch) and its properties.
Then the bridge was compared to the
ovs (OpenVSwitch) device.
Matt moved on to VMs and quickly discussed the
virtio memory page-sharing virtual network device we often see in Linux KVM guests.
Since this is 2019 and everything is containers now the next device we looked at was the
veth virtual ethernet pair (note that
veth devices always come in pairs and usually span network namespaces).
Finally Matt wanted to make two containers talk to each other without using a bridge. The
macvlan device did the trick because it’s simple. I’d argue it creates kind of a bridge although we don’t have to deal with STP and other exciting features. If you want to go really crazy there is always the
ipvlan device. Think twice. 👆️
There was nothing new for me in this talk but it was a good talk going through and un-confusing Linux software networking. The Core Principles track sometimes is like rolling dice. Nevertheless, I think it is important we have this track and that we address all levels. After all, networking is an important, sometimes undervalued area of SRE expertise.
The SREcon is hosted by the Suntec convention center. It’s a fancy place with spacious rooms and modern equipment. Its entrance features a huge wall of full HD screens.
A wall of hundreds of full HD screens asking me to make an impact.
I am curious how such a crazy amount of screens is being controlled. It seems I am not the only one who noticed the nearby WiFi network named “LG_signage”. Coincidence? 🤔 I don’t think so!
Anyway, this is SREcon. DEFCON is still two months away. I shall be a good citizen! But it is not easy. Look at this display! It is even bragging about itself!
The giant screen of screens asking to be played with…