How To Write A Tiny Shell In C

I was wondering how complex some shells are. That got me thinking what a very minimal -but useable- shell would look like. Could I write one in less than 100 lines of code? Let’s see!

A shell needs to execute commands. This can be done by overloading the current program with a new program and executing it (a lot more happens here, actually, but that is for another time). This can be done with the exec() family of functions. That seems to be a nice starting point. In a file named myshell.c I wrote:

#include <unistd.h>

int main(void) {
    execvp("date", (char *[]){"date", NULL});
}

Then I compiled the code:

$ gcc -Wall -pedantic -static myshell.c -o mysh

And executed it:

$ ./mysh
Mon Jun 18 18:51:38 UTC 2018

Cool! I hardcoded date here. But actually, a shell should prompt for the command to execute. So I went on and made the shell ask for a command to run.

#include <unistd.h>
#include <stdio.h>
#include <string.h>

#define PRMTSIZ 255

int main(void) {
    char input[PRMTSIZ + 1] = { 0x0 };
    fgets(input, PRMTSIZ, stdin);
    input[strlen(input) - 1] = '\0'; // remove trailing \n

    execvp(input, (char *[]){input, NULL});
}

And ran it:

./mysh
date
Mon Jun 18 19:27:21 UTC 2018

OK, nice. But that fails if I want to run a command with parameters:

./mysh
ls /

Nothing happens. I need to split the input into an array of char pointers. Each pointer shall point to a string containing a parameter:

#include <unistd.h>
#include <stdio.h>
#include <string.h>

#define PRMTSIZ 255
#define MAXARGS 63

int main(void) {
    char input[PRMTSIZ + 1] = { 0x0 };
    char *ptr = input;
    char *args[MAXARGS + 1] = { NULL };

    // prompt
    fgets(input, PRMTSIZ, stdin);

    // convert input line to list of arguments
    for (int i = 0; i < sizeof(args) && *ptr; ptr++) {
        if (*ptr == ' ') continue;
        if (*ptr == '\n') break;
        for (args[i++] = ptr; *ptr && *ptr != ' ' && *ptr != '\n'; ptr++);
        *ptr = '\0';
    }

    execvp(args[0], args);
}

And running it yields:

./mysh
ls /
bin  boot  dev	etc  home  lib	lib64  ✂️

It worked! I am getting closer. Wouldn’t it be great if the shell would not exit after one command, but ask for further commands every time the current command terminated? I think it is time for my old friend fork() to enter the scene!

#include <unistd.h>
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <sys/wait.h>

#define PRMTSIZ 255
#define MAXARGS 63

int main(void) {
    for (;;) {
        char input[PRMTSIZ + 1] = { 0x0 };
        char *ptr = input;
        char *args[MAXARGS + 1] = { NULL };

        // prompt
        fgets(input, PRMTSIZ, stdin);

        // convert input line to list of arguments
        for (int i = 0; i < sizeof(args) && *ptr; ptr++) {
            if (*ptr == ' ') continue;
            if (*ptr == '\n') break;
            for (args[i++] = ptr; *ptr && *ptr != ' ' && *ptr != '\n'; ptr++);
            *ptr = '\0';
        }

        if (fork() == 0) exit(execvp(args[0], args));
        wait(NULL);
    }
}

I now fork a child every time I want to execute a command. The child exits and uses the return value of execvp() as exit code. This can help the parent process to detect errors if they are any during program overload. The parent process waits for the child to finish. Everything happens in an infinite loop for(;;) to allow more than just one command.

./mysh
date
Mon Jun 18 19:44:27 UTC 2018
ls /
bin  boot  dev	etc  home  lib	lib64  ✂️

Despite being very limited in functionality, I think this now counts as a shell.

I couldn’t stop myself from adding a few more things:

  • Disable signal SIGINT in the parent: This means I can interrupt (ctrl-c) a child process without killing my shell. Very useful 😅
  • Add a visual prompt: $ for users and # for superusers.
  • Print the exit code of the child, e.g. <1> or <0>
  • Check for empty input: Because segfaulting is not nice. 🙈

Here is the final code:

#include <unistd.h>
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <sys/wait.h>

#define PRMTSIZ 255
#define MAXARGS 63
#define EXITCMD "exit"

int main(void) {
    for (;;) {
        char input[PRMTSIZ + 1] = { 0x0 };
        char *ptr = input;
        char *args[MAXARGS + 1] = { NULL };
        int wstatus;

        // prompt
        printf("%s ", getuid() == 0 ? "#" : "$");
        fgets(input, PRMTSIZ, stdin);

        // ignore empty input
        if (*ptr == '\n') continue;

        // convert input line to list of arguments
        for (int i = 0; i < sizeof(args) && *ptr; ptr++) {
            if (*ptr == ' ') continue;
            if (*ptr == '\n') break;
            for (args[i++] = ptr; *ptr && *ptr != ' ' && *ptr != '\n'; ptr++);
            *ptr = '\0';
        }

        // built-in: exit
        if (strcmp(EXITCMD, args[0]) == 0) return 0;

        // fork child and execute program
        signal(SIGINT, SIG_DFL);
        if (fork() == 0) exit(execvp(args[0], args));
        signal(SIGINT, SIG_IGN);

        // wait for program to finish and print exit status
        wait(&wstatus);
        if (WIFEXITED(wstatus)) printf("<%d>", WEXITSTATUS(wstatus));
    }
}

Running as root in a container:

./mysh
# ls /
bin  boot  dev	etc  home  lib	lib64  ✂️
<0># date
Mon Jun 18 19:50:09 UTC 2018
<0># nonexistent-command
<255># false
<1># true
<0># exit

In the end, I was able to write a tiny shell with limited capabilities in less than 50 lines of C code. That is less than half of what I aimed for.

SREcon Asia/Australia Day 3 (Report)

Last day of SREcon Asia 2018. Did I mention how important coffee is to get me started on a conference day? No? Coffee is much-needed. And the coffee here is great. ☕️😜

A word of warning: Today’s talks were packed with information. The report got very long. I apologize in advance!

Interviewing for Systems Design Skills

Sebastian Kirsch from Google Switzerland conducted over 200 interviews, was member of the hiring committee, and trained engineers in interviewing. We can check off credibility at this point I think. When they started hiring SREs in the early days of that role there was no job description because the job did not exist before.

  • Design a system for us in the interview
  • Be aware of performance needs, resource needs, and bottlenecks
  • Design from scratch for problems that have not been solved or existed before
  • They need to know why each selected component is a good fit

Other system aspects in regards to manageability and lifecycle:

  • How to maintain the components
  • How to scale the system
  • How to upgrade the systems (versioning)
  • How to migrate data or the system

In an interview your time is limited. So they can’t go into too much details and complexity. But there is a trick: Build a simple system and scale it up. Even simple systems become complex within minutes by just scaling them up. That can be used to talk about complexity even in time-bounded interviews.

Example question: Design a system to copy a file to some machines. Easy, right? Just do a for loop and rsync the file to the target system. Now, scale! Ludacris requirements are allowed:

  • File is 100GB in size
  • Once a day
  • 100k target machines
  • Machines are placed on the moon (uh, latency!)
  • Source machine has a 100Mbit NIC (and here goes sequential file transfering down the sink. Multicast anyone?)

Sebastian didn’t solve the puzzle in his talk, as this was only a demonstration how easy it is to scale a simple question into a complex engineering challenge.

How would one design such a question? Start with a very simple problem. Copy something. Process some data. Pass a message. It should be something from an area where you are familiar with. And then just scale up. Add a crazy requirement so that the problem needs at least a hundred thousand machines to be solved. Next step then is to find the bottlenecks in the system. Typical bottlenecks are disk IO, network bandwidth, disk seek times. A good question has a couple of bottlenecks. Ideally, it should not be clear from the beginning what the bottleneck dimension is. Fun fact: Designing these questions becomes harder over time as hardware contains more magic nowadays. Compared to the past, current computers are supercomputers. Having tons of cores and unreasonable amounts of memory. A lot of problems fit into memory of a 96 core and 1.5TB RAM machine. And that is something that you can just click on Google Cloud Platform. Oh, and disk seeks? You can buy guaranteed IOPs on SSD on the cloud. Snap! Hardware takes all the fun out of engineering problems. Sebastian suggested to just make the problem bigger (oh, did I mention that we want to serve all humans and extraterrestrials? Design for that!) The other option the interviewer has is to limit the candidates option of machines. I like the first option better. We have IPv6, so multi-planet distributed systems should be at least addressable by now.

Another challenge for the interviewer is that at some point maths (e.g. total bandwidth calculation) is going to happen. That has to be checked, while the interviewer has to take notes and think about the next steps already. Solution: Have a cheat sheet with pre-calculated values for the anticipated solutions. Also on the cheat sheet: Possible follow-up questions.

Practice: Never go into an interview with a question that you have not tested. Testing on an external candidate is neither fair nor gives it the required amount of signal. The first iteration is probably way too simple or way too complex. The problem description could be misleading. Practice can iron out all of these problems and calibrate your expectations. Test the questions with your coworkers. Highly relatable! I remember how surprised I was how long it takes to solve one of our coding questions when you never seen it before. I once spend an hour with a candidate sorting an array and still got a ton of signal out of that.

The hiring bar: The problem is not really whether or not a candidate is below or above the hiring bar. The problem is more that we do not know what the candidate’s skill level really is. So the hiring bar problem is more an uncertainty problem. There is a huge error bar or grey area. A lot of candidate will touch the hiring bar at some point. The interviewer’s job is to decrease uncertainty and become certain whether or not a candidate’s skill levels are below or above the hiring bar. It is a classification problem with a lot of uncertainty essentially.

Interview time must be used to decrease uncertainty. Everything else is just time wasted. To gain as much signal as possible, start with making clear what the expectations are. This includes technical requirements. Furthermore, it includes making clear that you expect reasoning about choices and questions from the candidate to the interviewer. Careful steering enables the interviewer to use the time more effectively.

Hints from the practice:

  • Many candidates forget to provide concrete resource estimates (e.g. number of drives needed). Question: What goes on the purchase order for this system?
  • Sometimes there are no clear boundaries between the systems and how they work together. Question: You have N teams of engineers that are going to implement the system. What entities should each team work on?
  • Candidates may get stuck in precise arithmetics. So a hint could be to give roundings: A day consists of N seconds. Use that value to proceed. In the end, it is all about orders of magnitude.
  • The magic bullet: Candidates use a standard technology because they think we ask for knowledge. Question: How does this technology work?

How do we determine how well a candidate did?

Go into the interview with expectations. Compare the results with the expectations. Have expectations for each axis or dimension you are looking for. Does the designed system solve the problem? (Often, the system solves a different problem. Woot?) Good candidates trade resource dimensions for each other. E.g. use more CPU time to reduce network bandwidth. Did the candidate think about a possible SLO?

(Note: No photo here. Projector was broken and Sebastian ran the whole thing without slides. Playing it cool. Well done, sir!)

From the Q&A: Shall I read previous interview feedback on that candidate before going into the interview? Better not. It introduces an unfair bias and gives the first interviewers more weight in the process. At Google they do not propagate the actual feedback, but the topics, so that they minimize overlap without biasing the result.

Link to the talk: Interviewing for Systems Design Skills

Scaling Yourself for Managing Distributed Teams Delivering Reliable Services

Paul Greig is an SRE lead at Atlassian and talked about the challenges one faces in managing a distributed teams. We are talking about 18 SREs, two Team Leads who manage two services distributed across three geographic regions.

He started with the question what a distributed team is. His definition ranges from a single worker somewhere on the world to teams where everyone is in a different place collaborating over the Internet.

Benefits of a distributed teams:

  • Better geographic coverage
  • Talent can be hired regardless of where the talent wants to live
  • Colocation: The SREs can be closer to the developers (Not sure I got this one correctly. I’d say it often is the opposite?!)
  • Better On-call, because time zones.

Challenges of a distributed team:

  • Friction and lack of cohesion: The divide leads to absence of communication sometimes.
  • Duplication: When the right hand doesn’t know what the left hand does, work is often duplicated.
  • Costs: Distributed teams are actually expensive. The impact may be a little lower than that of a team sitting in the same room.
  • Imbalance: The day of a life in an engineer can be very different from that of another one. Paul visited his team members and found that they work very differently.

Now, how to scale yourself (as a leader) in a distributed environment? Three aspects:

  • Presence
  • Planning
  • Balance.

Presence

Establish Trust. ”This is a huge one!” Paul said. Being present and asking questions (includes active listening) he built trust. Jumping to solving problems without establishing trust first doesn’t work. As leading a distributed team involves a lot of travel, you need to plan ahead to keep work and life in a healthy balance. Also, engineers sometimes like to travel to the HQ to mix and mingle. Scheduling one-on-ones at a time where it is suitable for the leader and the team member, who might be in a different time zone. Example: ”After I sent the kids to sleep I spent two hours on being present (online) for the team in a different time zone.” Allow teams some time to think about a problem and get to results. There is a time gap, naturally. However, at some point ask to wrap an issue up. It’s mostly about reasonable expectations. Participate rather than rushing the process.

Planning

Local team vs. remote team: Working with the local team the whole day: How can I make sure that I give the same amount of attention to remote team members? Plan for delays in pull requests. The velocity of big projects can slow down if you are waiting for feedback from a different time zone (and that time zone is asleep currently). Is separating projects and assigning it to different time zones a solution? Not really! So in the retro they were introducing end-of-quarter demos, regardless of the project state, there had to be a demo. That gave the teams a common goal and brought velocity back into the project and kept everyone on the same page.

Balance

Maintain balance for yourself. But also for your team. The team will only remain successful if they take care of their individual balance as well. Includes life work balance obviously Balance three aspects 👩🤚❤️: Head (mental), Hands (practical), and Heart (emotional)

Money Quotes:

  • We saw disheartening duplicate solutions to the same problem. Great for redundancy, but expensive.
  • Inspiration: I want to see a thirst from the engineers of the team to arrive at a fantastic outcome.
  • We did not have these “sitting together at lunch time” sort of moments. So we played Keep talking so nobody explodes. (Also an awesome incident management communication training)
  • I am not saying video games are the solution to every problem, but… 😂
  • Listen, Ask, Tell

Paul published his Talk Resources

Link to the talk: Scaling Yourself for Managing Distributed Teams Delivering Reliable Services

Mentoring: A Newcomer’s Perspective

Leoren Tanyag from REA Group grew up with little exposure to technology which made her want to go into exactly this industry. She went through the graduate/internship program of REA and shared her experiences.

REA acknowledges that school is different and an un-mentored environment is not something they want their people to be exposed to. They communicate expectations on what to learn and where to start. Of course the mentee has a say in driving the direction. Emphasis is on pairing with co-workers.

Concept of the three sixes:

  • On your first 6 days: I don’t know everything
  • On your first 6 weeks: I know what I’m doing. No, you don’t!
  • On your first 6 months: I may know what I need to know, now I have to continually improve.

Interestingly: Even after ten years people often look up to their mentors.

Leoren reports that mentees expectations are often not very high. They are so happy that they can learn from someone they respect. So there is no excuse for an experienced engineer not to mentor. Even of you are not very experienced, there is something you can share and a mentee can learn from you. I can not agree more. I, personally, expect sharing knowledge from every engineer in my team. Mentoring is one form of that and it is everyone’s responsibility to make every other team member succeed.

Forms of mentoring (different in structure and frequency):

  • One on one mentoring: Highly tailored to a mentees needs
  • Group Learning sessions: More efficient if multiple mentees share the same interest (e.g. learning a new technology). Also, some mentees feel less pressure in a group and enjoy having a group to talk to afterwards. Interesting aspect I have not thought about before!
  • Casual Pairing: Supports bonding and transfer hands-on knowledge. Communication heavy, obviously.

Money Quotes:

  • Good Mentors come from those that have been mentored before. Start the trend!
  • Teaching is the best way to learn.
  • Mentoring can help us grow (mentors become role models and may hold themselves to higher standards once they realize that)

From Q&A: How should mentoring relationships be started? Driven my mentee or mentor?

Leoren: We have a queue of people who agree to volunteer time. We use vouchers to get people in touch with each other.

Link to the talk: Mentoring: A Newcomer’s Perspective

(Unrelated side note: Was this Comic Sans in the slides? Looks like Comic Sans but then it also looks different. Why do I care about this at all? Why are fonts such an interesting thing?)

Blame. Language. Sharing: Three Tips for Learning from Incidents in Your Organization

Lindsay Holmwood from Envato started with story from when he was a teenager and was challenged with a serious illness. From there it went up and down (storywise) including topics like emergency room, cancer, surgery leftovers remaining in the body, and open heart surgery. There was a mistake made in the treatment and it was made sure that the person doing the mistake would not practice medicine anymore.

Years later, when he was running a cancer-related charity, he accidentally took down the website in the middle of one of the most important campaigns. He did not receive any punishment for that.

Comparing these two stories, he realized that how an organisation responds to failure and errors was different here. Some look for someone to blame, some don’t. The narrative for safety here is the absence of failure (e.g. in process, technology, people). If you think this through, this maintains a culture where any three of these can be a danger. People are a harm to safety here. Not necessarily a nice environment to work in, right? This all refers to a bi-modal operation mode. It is either safe, or it is not working at all.

In our industry, things are a bit different. Our systems are so large and complex, ther are always running in a partially degraded state. But they still work, often stil generating revenue, even though they are in a degraded state. That asks for a different culture. We have to embrace different perspectives, evn contradictory statements (think: views) about the state of a system can be both true at the same time.

The people doing the frontline work, in our culture, focus on quality and delivery. But they do not focus on covering their mistakes.

Lindsay then identified the three aspects of our culture he thinks are most important:

  • Language
  • Blame
  • Sharing

Language

Why is a good word to use for talking about systems. Why did that action take so long? But it is a bad word to use in regards to people. It often carries some blame. Why didn’t you follow the runbooks? Be careful using Why, you do not want to question someones personality, right? How is better, but may be limiting the scope too much. How did that happen? What ties better with local rationality. The latter is a concept that assumes that people make trade-offs based on the limited information they had at the given time and in good intentions. Think about someone moving forward in a dark tunnel having a small flashlight. You never see the whole picture, only what you point the flashlight too. In hindsight, of course, things are always different. So what asks for how the tunnel looked like at the time you were inside it.

Blame

Thinking of people as hazards ignores one truth: Sometimes bad things happen and no one is to blame. Things go right more often than they go wrong. Finger pointing is basically a cognicitve bias. It’s something we have little control over. Our brain (perception) makes trade-offs between timeliness and accuracy all the time when processing information. That brings problem solving down to heuristics often. Lot of psychology here! He played a classic in confirmation bias: The monkey. Knowing that, think about how well we will do during an incident? Hooray for those who are not biased. Spoiler: No one immune to bias. Another thing to watch out for: In hindsight we tend to assign a failure to a person, not a system. The more negative the outcome is, the more biased we are to blaming a person, not a system.

Sharing (This part was so interesting, I forgot to take notes. Sorry!)

My takeaway: As a leader in SRE, it is your responsibility to create a psychological safe environment. That is quite in line with my personal beliefs and also what I just recently read in Leaders Eat Last. Is this bias in action? 🤔Oh, this is becoming meta now!🤪[But seriously, read the book and learn how we (ex-)military leaders are wired up internally and how we build teams where people are willing to risk their lives for each other!]

This is quite a long writeup of the talk, but, it is only a part of it. There was so much information packed into that hour. If you want to get the whole picture, bookmark the talk and wait for the video to be released. There you can also learn how Lindsay’s teenage story about life and death ended. I won’t spoil it here. (Ok, you obviously figured out that he survived. But the details matter!)

Link to the talk: Blame. Language. Sharing: Three Tips for Learning from Incidents in Your Organization

A Theory and Practice of Alerting with Service Level Objectives

Yesterday evening, at the reception, I asked Jamie Wilkinson (Google) a ton of question related to a tricky SLO measurement problem we face at work currently. Every now and then he would refer to his upcoming talk. So I went into this one with expectations set high.

He explained the differences of alerting on cause vs. alerting on symptons. While it is convenient to know exactly what went wrong (cause) it is much better to alert on symptons. Why? Because it results in less alerts and avoids alert fatigue much better. Every alert disturbs a persons life when on-call. And we care about people more than we care about a dying disk in the end, right?

Focus on alerting on a very small set of expectations that matter for the users of the service. Do not focus on alerting on all the nitty-gritty details, timeouts, and other metrics that may cause a problem.

How do we distinguish between a symptom and a cause? One rule is to ask yourself: Does this affect the users experience? For example, if you have reasonable latency, who cares about the depth of a queue in a backend system? No need to alert for that, we can solve it during office hours. Or maybe decide to not touch it at all, as it might just be fine.

There was also some discussion about where to measure time series and what kind of time series. For example, it is probably easy to increase request counters within the application. However, measuring at the load balancer enables us to also catch those crashing services as they become unavailable. I run a service at work where we use both. Mostly because we agreed to an unfortunate SLO that turned out to be extremely tricky to measure with the given tools. Even Matt suggested that we may be up for an adventure trying to visualize that specific SLO using the current feature set of Stackdriver. Setting SLOs is hard…

Back to the talk: Jamie introduced the concept of SLO burn rate. I like that very much. Alerting on the burn rate is something I definitely will bring back to my team in Munich.

Money Quotes:

  • If we don’t use the error budget to our advantage, it will just be lost. Use the error budget!
  • If a microservice crashes in the cloud and nobody notices, does it make a sound? 🤔😂(I’d add to this: Should it make a sound? Why? Why not?)

Link to the talk: A Theory and Practice of Alerting with Service Level Objectives

Production Engineering: Connect the Dots

Espen Roth talked about how graduates are onboarded in Production Engineering at Facebook. And what challenges come with that. He was hired by Facebook right after school, without a previous internship there, so he experienced the process firsthand.

New graduates build things from scratch in school. They ran small self-contained projects that they had to finish in time. They know about what is currently out there. What they lack is experience in maintaining a system on the long term. Teamwork is even more important in the job world and unlike school, in a job you are there for the long run. And sometimes you have little choice on what you have to work on. The grades don’t matter anymore, but impact does. Working in Production Engineering is very different from school.

As a company interviewing graduates, you want to look at three aspects:

  • Capacity (skills, learning)
  • Passion
  • Opportunity

Internships are the ultimate interviews. Internships provide a long exposure to feedback and hiring a former intern requires little ramp-up.

Link to the talk: Production Engineering: Connect the Dots

Mental Models for SREs

Mohit Suley (Microsoft) woke up the audience by asking a couple of questions to test our mental models before the actual talk started.

The first mental model he talked about was the survivorship bias. The connection to SRE is that survivorship bias can influence metrics.

Then he went on to introduce ludic fallacy and explained it by using examples. We laughed at a system that was designed with too little login capacity. Another example was Tay which still seems to be something that people inside Microsoft joke about. No plan survives first contact with the enemy…

Then I learned about decision fatigue. So obviously, the capacity to make decisions decreases over time. A practical advice is, therefore, to not go online shopping when you’re on-call.

It got better! If a thing has been proven over a long period of time, we call this the Lindy Effect. How to connect this to the SRE world? Mohit suggested that even in the age of cloud, client computing is not going down. And in fact, our devices are more powerful than ever. The cloud (remote compute) just grows faster. Another one: Email is still around in 2018. Probably here to stay.

There is much more:

Link to the talk: Mental Models for SREs

My Summary

This is my personal summary which comes without further explanation. Think of this as a note to myself that accidentally went public:

  • Learn more about rubrics and how I can use them to gain better signal from interviews.
  • Managing a distributed team requires thoughtful and patience leadership.
  • Keep mentoring. It means a lot to mentees. Also, being a mentor and/or a mentee is a learning experience.
  • Humans are biased. Providing psychological safety is key to good leadership.
  • Symptom based alerting. SLO burn rate alerting. It’s better for human health. It makes so much sense!
  • Mental models are fun! Easy to get lost in Wikipedia when you start looking them up…

Wow, you made it to the end of this very long article! Thank you so much for reading. As always, feedback is highly appreciated. As a reward, have this picture of a cute squirrel I met earlier this year in New York:

SREcon Asia/Australia Day 2 (Report)

Day two. The coffee is still great and still much needed. ☕️An evil twist by the kitchen crew: They changed places of the tea and the coffee dispenser.😳 A lot of jumping rooms today because the most interesting talks always happen in the other room, right? Keeps the blood flowing and after all, I am in the fitness industry.

Automatic Data Center and Service Deployments Based on Capacity Planning Artifacts

Xiaoxiang Jian from the Alibaba Group reported on the difficulties they faced with deploying data centers, including:

  • Capacity planning
  • Server planning
  • Network cabling
  • Bootstrapping of network, operating systems, and services

To give you some scale, we are talking about deploying tens to hundreds of data centers per month. According to Xiaoxiang this later went up to tens to hundreds of data centers per week. Impressive. Where are all these data centers?

Their solution to the problem was defining artifacts and thus building an immutable infrastructure. We see this pattern all over the industry: Definition wins. It enables rollbacks, version controlled updates, and configuration lockdowns. Think of Kubernetes for data centers. Terraform comes to mind. But this goes a little bit deeper down the stack. Alibaba is baking images that end up on bare metal. They treat hardware as a black box which’s final state is defined by software. We’ve done similar things in the old days by leveraging PXE boot and pre-baked images, but back then we would have deployed the configuration via Puppet or Chef. Alibaba does the configuration upfront and delivers the final image to the hardware. Kind of nice, isn’t it?

The interesting part now is, how do you define a data center? They are using a two-phase approach:

  • Business planning: Categorize the services, plan capacity based on needs, plan the network
  • Delivery planning: Generate the network and operating system configurations

The artifacts then are:

  • Product: The final delivery for the business
  • Service: A software concept deployed on a cluster
  • Application: The real thing, runs as a process on a server

In practice, this could be a shop (product) that has a website (service) that runs on a Tomcat (application). The same principle can be applied to heavier products, such as Elastic Computing or Block Storage. Obviously, this means defining a lot of dependencies, such as particular node capabilities. You don’t want to run compute on a machine designed for storage. I’m probably wrong here, but I was just asking myself if that doesn’t bring dependency hell to the hardware world and whether this is a good or a bad thing. 🤔

The results, however, are quite impressive: Datacenter bootstrapping went down to 24 hours from previously 2 months (for most of the data centers).

Link to the talk: Automatic Data Center and Service Deployments Based on Capacity Planning Artifacts

Ensuring Reliability of High-Performance Applications

Anoop Nayak from LinkedIn started his talk with some interesting data on the status of the Internet in India:

  • 79% of users access the Internet through a mobile device
  • 85% use Android
  • 75% use Google Chrome

The 99th percentile of the LinkedIn page load in India was approximately 24 seconds. They were aiming for 6 seconds. Their approach was to create a LinkedIn Lite website. Actions to get there included:

  • Reduce the size of the mobile page, targeting a size under 75 KB
  • No client-side frameworks to increase page paint
  • Avoid client-side redirects, which account for approximately 2 seconds each on a slow network
  • Leverage Server-Side Rendering (only sent necessary HTML content to the client)
  • Early flushing sends HTTP headers to the client while the server is still rendering the HTTP body

In the end, LinkedIn Lite is an app of about 1 MB in size that wraps a web browser tab. Now, how to monitor that? A lot of monitoring is happening on the server side. Additionally, some metrics, such as client-side load times, can be extracted from the DOM. A few more metrics come from a small, custom library.

Another cool thing that they did is using service workers. Service workers are like background threads, but for the web. This can make a website feel like a native app. A word of caution: A service worker running wild can render the whole app useless. So having a kill switch for service workers is essential. LinkedIn controlled service worker behavior by setting the cache control HTTP header to private, max-age=0. This forces the service worker to throw away the cache, which can persist for up to 24 hours.

Money Quotes:

  • Status codes 3xx, 4xx, 5xx are all there. Monitor them!
  • Page load times must be tracked and monitored.
  • Service workers need a kill switch.
  • Web views can break. All the phones need testing.

Link to the talk: Ensuring Reliability of High-Performance Applications

Debugging at Scale - Going from Single Box to Production

Kumar Srinivasamurthy from Microsoft (Bing and Cortana Engineering) started off with the history of the computer bug and then quickly went to current tracing tools like Zipkin and BPerf.

And then he showed some cool ideas:

  • Use machine learning classifiers to analyze log data to find negative wording. E.g. sentences like action took too long indicate a problem.
  • Anomaly detection at scale.
  • Near real-time metrics to detect problems earlier.
  • Strip personal data from log files. Did someone say GDPR?

On a hack day at Microsoft, they created a Hololens SRE tool (all in prototype stage). You need to see this! We had some good laughs in the audience. Yes, I can imagine doing my job with a Hololens one day. Cool thing!

Link to the talk: Debugging at Scale - Going from Single Box to Production

Productionizing Machine-Learning Services: Lessons from Google SRE

Google SREs still fill the rooms, even if it is after lunch. Salim Virji and Carlos Villavieja shared their lessons learned when they wanted to apply machine learning to production

Machine Learning is good for everything, except when:

  • There is no fallback plan
  • There is not enough labeled data
  • One requires microsecond reaction time

Unsurprisingly, machine learning is used in almost every Google product. One of their most important models is the YouTube video recommendation model. Which comes with its own challenges, such as seasonal peaks of topics (Super Bowl), spam videos slipping into the training data, and regional popularity of videos.

Is an ML model just another data pipeline? Can we just run it like any other pipeline? Unfortunately, the answer is no.

Training and data quality: SREs run the models and the training in production because training is part of the production lifecycle of the model. New data comes in all the time, models need to evolve fast. Since the data quality is essential SREs have to filter and impute data to avoid spam and overfitting. Snapshotting the model and warm starting helps to deal with varying compute resources. When input data pipelines are not balanced, e.g. due to an outage in a region, the model may develop a cultural bias towards the other regions. Google also leverages parallel training and then decides which output model to put into production. 🧐

Allocation of hardware resources (GPU, TPU): Google produces a new TPU version every year. Nevertheless, the cost of training grows at a higher rate than production resources. Currently, there is a lack of reliable multi-tenancy in parts of the training infrastructure (if I understood that right?). Models are tested with the same binaries as in production, but there are still canaries. Canaries shall ensure that the new model behaves similarly to the old model. A completely different behavior would indicate a problem. Models then get signed before they end up in production. That’s cool!

Models come with their own set of problems. For example, if a model features new labels, you can not roll that back. The only way is to re-train the old model with the new labels. Not being able to rollback makes on-call life significantly harder I assume. 🤨

War story: A particular demographic reacted differently to a new model compared to the old one. Fewer clicks (loss of revenue) was the result. The issue was solved by monitoring models and alert in cases like this. I wish the speakers would be gone into more detail here. Sounds very interesting.

According to the speakers, ethics in machine learning is the big elephant in the room. So SREs at Google are able to stop machine learning predictions when a model behaves unethically. Experts also call for independent oversight. For example, the AI Now Institute does that. Running predictions in an ethical and fair way is very important to SRE. This means SREs must always be able to stop any prediction that goes havoc. Essentially, SREs must be root on any model. The fact that models are signed before they go into production hints at how important (or how advanced) ML models are at Google. 🤖

Fun fact: There is a YouTube model that is over 1TB in size. Woah! 🤯

From the Q&A: How to start with SRE for ML models?

  • Start with a very small model
  • Have the model spend a long time in canary
  • Have a data scientist ready when the model goes into production for the first time
  • Have a rollback plan that includes entirely removing the model from production

Link to the talk: Productionizing Machine-Learning Services: Lessons from Google SRE

How to Serve and Protect (with Client Isolation)

From the Google Maps SRE team, Frances Johnson was reporting about client isolation. Maps has a lot of customers, internal and external. The Google Assistant, for example, is an internal customer to Maps while me using my phone is considered an external request. Unexpectedly, Maps had monoliths and overload situations. Something I am way too familiar with in my current job.

Goals of the client isolation initiative:

  • Clients should not be able to hurt others
  • Graceful degrade the service in an overload situation
  • If you have to drop queries, be smart about which ones

Strategies they came up with:

  • Caching: Cached queries are cheap, adding caching is often easy, but not possible for all queries (billing, strict consistency requirements, …)
  • Quotas: Fun fact about quotas: Clients think a quota is a guarantee. Services think it is an upper limit. This can lead to over-subscription. They observed 7x over-subscription at some point in time.
  • Load Shedding: Not all traffic is created equal. Background and batch jobs are less critical than a waiting user’s request. Always drop the lower priority requests first.
  • Microservices: Not their largest problem, but they split up their monoliths.
  • Separate Stacks: Maximum isolation. Everyone gets their own. Doesn’t scale too good and produces quite some toil.

Money Quotes:

  • If everything is the highest priority, then nothing is.
  • We put it there because we did not want to write another service and hand it over to SRE. (Teams on why they put more stuff into a monolith)
  • Can you just exempt my super-important client? (User’s on client isolation)
  • A client big enough to ask for an exemption from client isolation is probably also big enough to damage your service. (So, the answer must always be: No exemption!)
  • Understand your queries and prioritize them accordingly!

Link to the talk: How to Serve and Protect (with Client Isolation)

A Tale of One Billion Time Series

Ruiyao Yao from Baidu talked about the monitoring systems in use at Baidu and the challenges they faced with Time Series Databases (TSDB). He started off with an example that I had a hard time following, but the money quote is: When you cannot reach www.baidu.com in China, your home network is broken!

If I understood Ruiyao correctly, Baidu is aiming for always up and that is the reason they invest so much in their monitoring. Monitoring data is furthermore used for capacity planning and troubleshooting. How much data are we talking about?

  • Millions of targets
  • 600+ metrics per target
  • 1B time series in total
  • 50TB written per day
  • 10M read/write requests per second
  • 40M data points in per second
  • 60M data points out per second
  • 50Gbps write and 100Gbps read

Writing time series is based on a log-structured merge tree. On top of that, they are using tables with a time to live (TTL) to expire data. There are data tables and metadata tables. Tags and indexes end up in the metadata table. The on-disk layout reminds me a bit of SSTables, but maybe I got the speaker wrong here. I had a hard time following the content at some points. Here is a slide with the layout.

I liked it once I understood why query latency is so important to Baidu: They run anomaly detection in real time based on metrics stored in the TSDB. For low-resolution data, such as trends over an hour, they pre-aggregate the data online in the TSDB called multi-level down-sampling. Down-sampled data is of smaller volume and can be stored longer, another benefit of down-sampling. If that wasn’t enough, here is another optimization: Users can define key metrics and the system also identifies so-called hot metrics. These metrics are then cached in Redis for even faster access.

Typical for the used on-disk data format are many disk seeks. That was partly solved by compacting the files. As files grow too large, they are split and distributed to different systems. Reading can now happen in parallel on multiple systems. Nevertheless, compactions are still expensive. Thanks to the JVM, additional fun is provided by stop-the-world events. (I have a very difficult relationship with the JVM 😇)

This talk was really interesting and digging deeper into the interesting engineering problems as it progressed.

Money Quotes:

  • There are latency-sensitive and latency-insensitive queries to the TSDB. Treat them differently to optimize for each type of query.
  • People like to query over all the hosts or the whole year.
  • Unfortunately, our TSDB stack uses Java as the main language. 😎

Link to the talk: A Tale of One Billion Time Series

Isolation without Containers

Tyler McMullen from Fastly shared his thoughts on isolation on bare-metal. From the broader topic of general isolation, we quickly went via Fault Detection, Isolation (from a control engineering perspective), and Recovery into how processes are managed by the Kernel. A process is basically memory (containing data and code) and metadata in the Kernel (think Linux task_struct). Unsurprisingly, containers are just processes with applied resource isolation via namespaces.

And here is why Fastly is interested in this: High performance systems with many small tenants and strict latency requirements may find VMs, containers, and even processes all far too heavyweight.

To achieve isolation without using all these technologies, one just has to make sure the control flow and the data is understood and cannot run havoc. Easy, right? Interestingly, Webassembly meets these requirements as it is bound checked and cannot load random libraries (contrary to dlopen()). Because we ran out of time, Tyler had to skip over the most interesting part of the talk. But essentially Fastly found a way to compile multi-tenant Webassembly code into a process that can run safely on bare metal. Wrap your head around this! I did enjoy this talk very much as it was more low level than the others talks from that day. I love low level talks.

Money Quotes:

  • Fault isolation is really about reducing the set of possible faults to a knowable, recoverable set.
  • Everything in the memory is basically the wild west. 🤠

Link to the talk: Isolation without Containers

My Summary

This is my personal summary which comes without further explanation. Think of this as a note to myself that accidentally went public:

  • Defined state wins over doing something.
  • One can build fast websites and apps for slow networks, but trade-offs must be made.
  • SRE’ing machine learning models is a whole field of its own. Highly interesting, I want to learn that! Comes with high responsibility, which is something I like.
  • Client isolation is important. Luckily there are plenty of techniques and strategies. But implementing only a single one is never enough.
  • Scaling a TSDB isn’t trivial but comes with a ton of interesting problems.
  • Webassembly is awesome. And safely abusable. 😅

Also, this happened: I won a selfie drone at a raffle at the LinkedIn booth. Thank you so much, guys! I love it!