Day two. The coffee is still great and still much needed. ☕️An evil twist by the kitchen crew: They changed places of the tea and the coffee dispenser.😳 A lot of jumping rooms today because the most interesting talks always happen in the other room, right? Keeps the blood flowing and after all, I am in the fitness industry.
Automatic Data Center and Service Deployments Based on Capacity Planning Artifacts
Xiaoxiang Jian from the Alibaba Group reported on the difficulties they faced with deploying data centers, including:
- Capacity planning
- Server planning
- Network cabling
- Bootstrapping of network, operating systems, and services
To give you some scale, we are talking about deploying tens to hundreds of data centers per month. According to Xiaoxiang this later went up to tens to hundreds of data centers per week. Impressive. Where are all these data centers?
Their solution to the problem was defining artifacts and thus building an immutable infrastructure. We see this pattern all over the industry: Definition wins. It enables rollbacks, version controlled updates, and configuration lockdowns. Think of Kubernetes for data centers. Terraform comes to mind. But this goes a little bit deeper down the stack. Alibaba is baking images that end up on bare metal. They treat hardware as a black box which’s final state is defined by software. We’ve done similar things in the old days by leveraging PXE boot and pre-baked images, but back then we would have deployed the configuration via Puppet or Chef. Alibaba does the configuration upfront and delivers the final image to the hardware. Kind of nice, isn’t it?
The interesting part now is, how do you define a data center? They are using a two-phase approach:
- Business planning: Categorize the services, plan capacity based on needs, plan the network
- Delivery planning: Generate the network and operating system configurations
The artifacts then are:
- Product: The final delivery for the business
- Service: A software concept deployed on a cluster
- Application: The real thing, runs as a process on a server
In practice, this could be a shop (product) that has a website (service) that runs on a Tomcat (application). The same principle can be applied to heavier products, such as Elastic Computing or Block Storage. Obviously, this means defining a lot of dependencies, such as particular node capabilities. You don’t want to run compute on a machine designed for storage. I’m probably wrong here, but I was just asking myself if that doesn’t bring dependency hell to the hardware world and whether this is a good or a bad thing. 🤔
The results, however, are quite impressive: Datacenter bootstrapping went down to 24 hours from previously 2 months (for most of the data centers).
Ensuring Reliability of High-Performance Applications
Anoop Nayak from LinkedIn started his talk with some interesting data on the status of the Internet in India:
- 79% of users access the Internet through a mobile device
- 85% use Android
- 75% use Google Chrome
The 99th percentile of the LinkedIn page load in India was approximately 24 seconds. They were aiming for 6 seconds. Their approach was to create a LinkedIn Lite website. Actions to get there included:
- Reduce the size of the mobile page, targeting a size under 75 KB
- No client-side frameworks to increase page paint
- Avoid client-side redirects, which account for approximately 2 seconds each on a slow network
- Leverage Server-Side Rendering (only sent necessary HTML content to the client)
- Early flushing sends HTTP headers to the client while the server is still rendering the HTTP body
In the end, LinkedIn Lite is an app of about 1 MB in size that wraps a web browser tab. Now, how to monitor that? A lot of monitoring is happening on the server side. Additionally, some metrics, such as client-side load times, can be extracted from the DOM. A few more metrics come from a small, custom library.
Another cool thing that they did is using service workers. Service workers are like background threads, but for the web. This can make a website feel like a native app. A word of caution: A service worker running wild can render the whole app useless. So having a kill switch for service workers is essential. LinkedIn controlled service worker behavior by setting the cache control HTTP header to
private, max-age=0. This forces the service worker to throw away the cache, which can persist for up to 24 hours.
- Status codes 3xx, 4xx, 5xx are all there. Monitor them!
- Page load times must be tracked and monitored.
- Service workers need a kill switch.
- Web views can break. All the phones need testing.
Link to the talk: Ensuring Reliability of High-Performance Applications
Debugging at Scale - Going from Single Box to Production
And then he showed some cool ideas:
- Use machine learning classifiers to analyze log data to find negative wording. E.g. sentences like action took too long indicate a problem.
- Anomaly detection at scale.
- Near real-time metrics to detect problems earlier.
- Strip personal data from log files. Did someone say GDPR?
On a hack day at Microsoft, they created a Hololens SRE tool (all in prototype stage). You need to see this! We had some good laughs in the audience. Yes, I can imagine doing my job with a Hololens one day. Cool thing!
Link to the talk: Debugging at Scale - Going from Single Box to Production
Productionizing Machine-Learning Services: Lessons from Google SRE
Machine Learning is good for everything, except when:
- There is no fallback plan
- There is not enough labeled data
- One requires microsecond reaction time
Unsurprisingly, machine learning is used in almost every Google product. One of their most important models is the YouTube video recommendation model. Which comes with its own challenges, such as seasonal peaks of topics (Super Bowl), spam videos slipping into the training data, and regional popularity of videos.
Is an ML model just another data pipeline? Can we just run it like any other pipeline? Unfortunately, the answer is no.
Training and data quality: SREs run the models and the training in production because training is part of the production lifecycle of the model. New data comes in all the time, models need to evolve fast. Since the data quality is essential SREs have to filter and impute data to avoid spam and overfitting. Snapshotting the model and warm starting helps to deal with varying compute resources. When input data pipelines are not balanced, e.g. due to an outage in a region, the model may develop a cultural bias towards the other regions. Google also leverages parallel training and then decides which output model to put into production. 🧐
Allocation of hardware resources (GPU, TPU): Google produces a new TPU version every year. Nevertheless, the cost of training grows at a higher rate than production resources. Currently, there is a lack of reliable multi-tenancy in parts of the training infrastructure (if I understood that right?). Models are tested with the same binaries as in production, but there are still canaries. Canaries shall ensure that the new model behaves similarly to the old model. A completely different behavior would indicate a problem. Models then get signed before they end up in production. That’s cool!
Models come with their own set of problems. For example, if a model features new labels, you can not roll that back. The only way is to re-train the old model with the new labels. Not being able to rollback makes on-call life significantly harder I assume. 🤨
War story: A particular demographic reacted differently to a new model compared to the old one. Fewer clicks (loss of revenue) was the result. The issue was solved by monitoring models and alert in cases like this. I wish the speakers would be gone into more detail here. Sounds very interesting.
According to the speakers, ethics in machine learning is the big elephant in the room. So SREs at Google are able to stop machine learning predictions when a model behaves unethically. Experts also call for independent oversight. For example, the AI Now Institute does that. Running predictions in an ethical and fair way is very important to SRE. This means SREs must always be able to stop any prediction that goes havoc. Essentially, SREs must be root on any model. The fact that models are signed before they go into production hints at how important (or how advanced) ML models are at Google. 🤖
Fun fact: There is a YouTube model that is over 1TB in size. Woah! 🤯
From the Q&A: How to start with SRE for ML models?
- Start with a very small model
- Have the model spend a long time in canary
- Have a data scientist ready when the model goes into production for the first time
- Have a rollback plan that includes entirely removing the model from production
Link to the talk: Productionizing Machine-Learning Services: Lessons from Google SRE
How to Serve and Protect (with Client Isolation)
From the Google Maps SRE team, Frances Johnson was reporting about client isolation. Maps has a lot of customers, internal and external. The Google Assistant, for example, is an internal customer to Maps while me using my phone is considered an external request. Unexpectedly, Maps had monoliths and overload situations. Something I am way too familiar with in my current job.
Goals of the client isolation initiative:
- Clients should not be able to hurt others
- Graceful degrade the service in an overload situation
- If you have to drop queries, be smart about which ones
Strategies they came up with:
- Caching: Cached queries are cheap, adding caching is often easy, but not possible for all queries (billing, strict consistency requirements, …)
- Quotas: Fun fact about quotas: Clients think a quota is a guarantee. Services think it is an upper limit. This can lead to over-subscription. They observed 7x over-subscription at some point in time.
- Load Shedding: Not all traffic is created equal. Background and batch jobs are less critical than a waiting user’s request. Always drop the lower priority requests first.
- Microservices: Not their largest problem, but they split up their monoliths.
- Separate Stacks: Maximum isolation. Everyone gets their own. Doesn’t scale too good and produces quite some toil.
- If everything is the highest priority, then nothing is.
- We put it there because we did not want to write another service and hand it over to SRE. (Teams on why they put more stuff into a monolith)
- Can you just exempt my super-important client? (User’s on client isolation)
- A client big enough to ask for an exemption from client isolation is probably also big enough to damage your service. (So, the answer must always be: No exemption!)
- Understand your queries and prioritize them accordingly!
Link to the talk: How to Serve and Protect (with Client Isolation)
A Tale of One Billion Time Series
Ruiyao Yao from Baidu talked about the monitoring systems in use at Baidu and the challenges they faced with Time Series Databases (TSDB). He started off with an example that I had a hard time following, but the money quote is: When you cannot reach www.baidu.com in China, your home network is broken!
If I understood Ruiyao correctly, Baidu is aiming for always up and that is the reason they invest so much in their monitoring. Monitoring data is furthermore used for capacity planning and troubleshooting. How much data are we talking about?
- Millions of targets
- 600+ metrics per target
- 1B time series in total
- 50TB written per day
- 10M read/write requests per second
- 40M data points in per second
- 60M data points out per second
- 50Gbps write and 100Gbps read
Writing time series is based on a log-structured merge tree. On top of that, they are using tables with a time to live (TTL) to expire data. There are data tables and metadata tables. Tags and indexes end up in the metadata table. The on-disk layout reminds me a bit of SSTables, but maybe I got the speaker wrong here. I had a hard time following the content at some points. Here is a slide with the layout.
I liked it once I understood why query latency is so important to Baidu: They run anomaly detection in real time based on metrics stored in the TSDB. For low-resolution data, such as trends over an hour, they pre-aggregate the data online in the TSDB called multi-level down-sampling. Down-sampled data is of smaller volume and can be stored longer, another benefit of down-sampling. If that wasn’t enough, here is another optimization: Users can define key metrics and the system also identifies so-called hot metrics. These metrics are then cached in Redis for even faster access.
Typical for the used on-disk data format are many disk seeks. That was partly solved by compacting the files. As files grow too large, they are split and distributed to different systems. Reading can now happen in parallel on multiple systems. Nevertheless, compactions are still expensive. Thanks to the JVM, additional fun is provided by stop-the-world events. (I have a very difficult relationship with the JVM 😇)
This talk was really interesting and digging deeper into the interesting engineering problems as it progressed.
- There are latency-sensitive and latency-insensitive queries to the TSDB. Treat them differently to optimize for each type of query.
- People like to query over all the hosts or the whole year.
- Unfortunately, our TSDB stack uses Java as the main language. 😎
Link to the talk: A Tale of One Billion Time Series
Isolation without Containers
Tyler McMullen from Fastly shared his thoughts on isolation on bare-metal.
From the broader topic of general isolation, we quickly went via Fault Detection, Isolation (from a control engineering perspective), and Recovery into how processes are managed by the Kernel.
A process is basically memory (containing data and code) and metadata in the Kernel (think Linux
Unsurprisingly, containers are just processes with applied resource isolation via namespaces.
And here is why Fastly is interested in this: High performance systems with many small tenants and strict latency requirements may find VMs, containers, and even processes all far too heavyweight.
To achieve isolation without using all these technologies, one just has to make sure the control flow and the data is understood and cannot run havoc. Easy, right?
Interestingly, Webassembly meets these requirements as it is bound checked and cannot load random libraries (contrary to
Because we ran out of time, Tyler had to skip over the most interesting part of the talk. But essentially Fastly found a way to compile multi-tenant Webassembly code into a process that can run safely on bare metal.
Wrap your head around this!
I did enjoy this talk very much as it was more low level than the others talks from that day.
I love low level talks.
- Fault isolation is really about reducing the set of possible faults to a knowable, recoverable set.
- Everything in the memory is basically the wild west. 🤠
Link to the talk: Isolation without Containers
This is my personal summary which comes without further explanation. Think of this as a note to myself that accidentally went public:
- Defined state wins over doing something.
- One can build fast websites and apps for slow networks, but trade-offs must be made.
- SRE’ing machine learning models is a whole field of its own. Highly interesting, I want to learn that! Comes with high responsibility, which is something I like.
- Client isolation is important. Luckily there are plenty of techniques and strategies. But implementing only a single one is never enough.
- Scaling a TSDB isn’t trivial but comes with a ton of interesting problems.
- Webassembly is awesome. And safely abusable. 😅
Also, this happened: I won a selfie drone at a raffle at the LinkedIn booth. Thank you so much, guys! I love it!