My Path to Site Reliability Management

Apr 2019

On my way to space I am currently taking a little stop to help organizing the world’s information and doing my part in making it universally accessible and useful. Since my rocket building talents are limited I devoted my energy to the wonderful challenge of Site Reliability Management (SRM). That is, empowering people on top of Site Reliability Engineering. Basically what I love combined with what I enjoy. Plus meetings.

Due to popular demand I decided to publish my lessons learned on the long journey from almost school drop-out to being a Site Reliability Manager at Google.

Am I even qualified? shall provide just enough context to make sense of the rest of the mini series’ articles. It helps the reader to understand my background when I started educating myself about SRE. It also answers the question that I asked myself the most: Am I really good enough to be an SRE or SRM?
I will quickly go over the importance of writing a compact but meaningful resume in Writing my Resume
In Organization, Culture, and Me I describe how I learned about Google’s culture and values and how I figured out that it was the right organization for me.
I introduce a simple method for learning SRE topics in An Infinite Ocean of Knowledge. It’s about the iterative approach that I used to stay on track with an ever-growing curriculum of relevant topics.
For the typical SRM interview types I share examples from my individual curriculum:

Let the journey begin!

Am I Even Qualified? (My Path to SRM)

I wasn’t the smartest kid in school. I was either bored and making better use of my time in class or frustrated by things I did not understand. Even if I ignore my devastating track record of attempts in mastering the Latin language I was a below average student in terms of grades. During most, but not all, school years I made it to the next level.

Not meeting expecations,small — The teachers conference decided to issue an official report. It translates to: “He has a very superficial, headstrong work attitude to the extent that he can not apply his knowledge constructively. Although he can work without supervision his written works are so messy, confusing, and incomplete that he sometimes can not even read it himself. He regularly ignores the task so that the result can not be accepted.”

In fact, I took a bit longer to finish school. Constantly choosing adventure over conventional education challenges I ended up taking a little detour on my way to university. In 2017, after serving over a decade in the Radio Signal Corps, I left the Army as an officer to pursue a career in the exciting realm of the Interwebz.

Veterans can be a healthy addition to a workforce that embraces servant leadership. On the downside, veterans are not necessarily top-notch tech people. Likely, we worked with ruggarized, battle-proven tech that is at least a couple of years behind the forefront of technology. Let me translate this to tech slang to drive home the point: When everyone was talking docker and containers, the military world slowly adopted to the idea of virtualization. We just loved the chilling touch of bare metal.

Me and my bare metal 1 MBit/s directed “wifi” antenna with a range of 60 to 80 kilometers (subject to weather conditions).

About a year before my term ended I started to ramp up my self-education efforts to prepare for my civilian afterlife. The problem was that I had no idea where to start and what type of job exactly I was in the market for. Sure, I had the chance to get some exposure to the civilian world earlier in my career by providing side-job consultancy services on information security. And yes, when I later wrote a book about IPv6 networking I once again got in touch with the reality of the outside world. But other than that, I was mostly qualified to lead people, educate young adults, manage assets worth millions of Euros, and apply risk management to a myriad of unexpected situations. Furthermore, one of my top skills was planning, building, and operating wide area radio networks in unfortunate (read: war) circumstances. Some of my skills would be valued by potential employers while others would not score any points. In hindsight, joining the military wasn’t my greatest life choice.

The (un)importance of certificates

Having spent years in the astonishing bureaucracy the military is I longed for getting “certified”. Certifications go a long way in a government job and it never hurts your government career to be certified for something. Usually, the more certificates one has the better. I wasted a lot of money on certificates. Here’s a best-of:

ITIL Expert
- It’s basically common sense for big corp IT, but without user trust.
- When I added that one to LinkedIn I even got a few offers, mostly from boring companies but not The Boring Company. ba-dum-tss
ISO 27000 Information Security Officer
- Here’s the officer again! Nice title. It bored me to death.
Service Integration and Management (SIAM)
- I honestly don’t remember what this was about. Probably something with managing third party contractors and IT. Nevertheless, I’m certified.
Management of Risk (M_o_R)
- Yes, the underscores are part of the brand name. D’oh!
- This one may have been one of the more useful ones, though. A lucky pick!

There are many more! For added hilarity, I listed all my certificates on LinkedIn where they are still present today and lure recruiters. They also serve as a reminder to myself to never go down that road again. Today, I consider the majority of my certificates useless. Lesson learned.

SRE WTF?

While I was getting more qualified for even more boring jobs with high certification requirements it happened that the Site Reliability Engineering (SRE) book landed on my desk. I don’t remember exactly how that happened. My friend Fred worked at Nest in Palo Alto when they got acquired by Google. His Silicon Valley stories always raised my interest. I was particularly curious how tech companies run their production systems and how they apply risk management. Somewhere around that time I had been advised to buy the SRE book. I digested every chapter of the book. I was (and still am) inspired by the tech and practices described in the book. With SRE the industry found a quite good balance between human needs and operation demands. Its clear objective of not feeding blood to the machines and establishing a blameless culture is what turns a scary job into a positively challenging adventure with a guarantee for surprise adrenaline every now and then…

After reading the book and watching a dozen talks I was sure: I wanted to become a Site Reliability Engineer and, if possible, with added responsibility for a team or small organisation.

SRE is about scalability on many levels. First and foremost, on the human level. We run computers that run computers that run computers…

Getting my hands dirty

I entered “Site Reliability Engineer Munich” into my favorite search engine and applied to the first three jobs posted by companies that sounded cool and interesting. Eventually I joined eGym which has been on a success streak and needed help becoming stronger on the SRE side of things. This allowed me to educate myself about SRE, try out new approaches, and help the operations team to sell the idea to the rest of the engineering department.

Eventually the team would re-organize to become the SRE team while establishing a close relationship with a development team for shared services. On-call was staffed with developers and SREs alike and we slowly adopted some SRE practices while respecting the existing culture of autonomous teams running their own infrastructure and operations. I enjoyed my job very much, mostly because I believed (and still believe) in the product but also because I was able to move things forward.

And then I was approached by a Google recruiter and a Facebook recruiter within the same week. At first I declined interviewing, I was happy with my job and also way too busy further establishing SRE within eGym. However, when I saw potential salary negotiations coming closer I felt like I should take the chance and evaluate my market value. What better way to get firsthand data than getting an actual offer?

The Outcome

After interviewing I got offered a Production Engineering role by Facebook in London and a Site Reliability Manager role by Google in Munich. This gave me a good starting point for the upcoming compensation discussion. Fun fact: When I got the Google offer call I had to duck into an abandoned storage room due to lack of privacy in the office.

Me nervously getting ready to answer an offer call in an abandoned storage room.

When faced with more options I could have dreamed of I took a week of vacation to consider my options and get consultation from my future wife. It was not an easy decision for me. I loved my job, most of my role, and most importantly I had a great, growing team. Eventually, I decided to stay a little longer with eGym to deliver upon an important migration project. After that, in January 2019, I joined Google Munich.

What does it mean to be qualified?

One important aspect I learned during the interviews is that skills and knowledge are two very different things. For sure, a job at Google asks for a lot of knowledge. But that knowledge is what I would call a healthy base. Knowing every detail of every technology that is related to my role would not have helped me much in the application and interviewing process. Once I achieved a certain (admittedly advanced) level of knowledge I had to demonstrate skills as well. Skills in this context means the application of knowledge, the ability to deal with occasional lack of knowledge, the constant striving for filling knowledge gaps and re-learning and questioning the status quo of technology. There is not a single day that goes by in an organisation as large and complex as Google where you do not have to acquire and apply new knowledge. Knowing and learning things is the bare minimum. Effectively applying knowledge in ambiguous situations is one of the major skills Google is looking for.

With that in mind, the answer is much easier: You are qualified when you found a healthy balance between knowledge and skills. What qualifies one is not the education, not the background, not previous tech jobs, and also not a track record of brilliant achievements at the expense of other people. Collaboratively applying knowledge in situations that come with varying levels of uncertainty, are inherently complex, and often loaded with human factors is what sets one apart and qualifies one. So much for certificates…

Writing my Resume

I think the importance of a good resume during the overall application process is often underestimated. The resume will most likely be looked at by interviewers when they prepare the interview questions. It will be skipped through at times by a person and will be read in full by another person. It is also part of the hiring package, which contains all application-relevant data for the Hiring Committee. The resume is the one document that I was in full control of in regards to structure, layout, and information selection. A good resume is an important data point but a bad resume would probably not be the only deal breaker in a negative hiring decision. Nevertheless, I wanted to get this one right.

The resume was the most important piece of data when I initially submitted my Google application.

I wanted my resume to give a good overview of what I have to offer. It should outline my skills and past impact but still fit on one page. I figured that one page may be short enough for the “skip readers” and at the same time provide enough room to present myself in detail where needed.

First I decided on sections that I wanted to include:

Employment and Education as those explain my background and how most of my skills are somehow connected to either previous employment or times of formal education.
Projects and Activities is the section under which I subsumed noteworthy community contributions. Like most tech companies Google expects community contributions for engineers of a certain seniority.
In the Skills section I chose to briefly list areas that were not clearly visible at the employment section. Listing my most relevant skills also meant that it is easier to get a glimpse for a person who is short on time.
Somehow I could not let go of the rather questionable Certifications I acquired. So they got their own little section.
To give readers a starting point for further research (and hopefully be impressed) I also added a section that point to my admittedly short list of Publications.
The remaining blank space on the page looked weird so I added a Languages section. I also played with the option to have a Hobbies section there instead. Both would have been fine I guess. In the final version I went with Languages. 🤷‍♀

A mid-process draft of my resume. Not entirely bad but also not very good.

Writing the resume took weeks. It took me a few iterations to get it down to a single page. Every other day I would send a snapshot to a friend for review.

Meaningful impact

I found it particularly hard to write about myself and my achievements in short compact sentences. So I started with longer sentences and condensed them. Having a reviewer helped me to not overdo the compression. With every shortening I removed some context and occasionally too little remained for the sentence to make sense. I needed to demonstrate meaningful impact, for example by quantifying the outcome.

Example

Let’s meet Kim. Kim worked in the DevOps team of a mid-sized company running about 100 services with N+1 replication. Kim is an experienced infrastructure expert with a focus on Kubernetes. Kim embodies the DevOps mindset of working closely together with developers and removing barriers between silos whenever feasible. This is what Kim wrote in their resume:

Worked as Kubernetes expert in the DevOps team

This sentence is correct but it leaves out three important aspects:

Context: What was the scale of operations? What were the organisational, technical, or cultural challenges?
Impact: How was the world a better place thanks to that person’s work? What was measured and how did the measured entity improve?
Meaning: What drove and motivated Kim? What made them contribute to the “Why” of the organisation?

A better version would go like this:

Managed Kubernetes cluster in accordance to developer needs to optimize initial service development time

If historic data was available and they are allowed to share that data, Kim could write:

Designed a developer-friendly self-service solution cutting down initial service development time by on average 30% in the company’s 150-node Kubernetes cluster

The last version provides context, measured impact, and meaning. It clearly shows that Kim embodied the DevOps mindset, created a self-service that helped the overall organisation to become more efficient, and reduced non-development work for developers.

Needless to say, I wanted to look as successful as Kim in my resume!

Update: I have been asked if it is OK to make numbers up if one does not have metrics about their own work. I’d like to give the question back: Do you think it’s OK? Doesn’t it feel wrong? I wouldn’t want to end up in a situation where I have to explain why my numbers don’t add up as soon as someone digs a bit deeper. To be clear: My answer is No! Don’t make stuff up.

Playing the “fair game”

I struggled a bit with the Skills section. I knew that everything listed on a resume is considered fair game to ask in an interview. Interviewers like to go deep on the topics a candidate claims to know or have mastered. Given Google’s hiring track record chances are they have an expert interviewer for any possible topic. So it is in general safer to only claim skills that I was sufficiently proficient in. But I also wanted to show that I am constantly learning and exploring new topics. I decided to add a footnote to indicate skills that I did not master yet. I found this to be a good compromise.

Resumes and cultural background

A resume is a very personal document. It is a place where an individual describes their professional and educational past. Some items on a person’s resume may be close to their heart or play(ed) an important role in their identity. My cultural background is mostly European with a strong German touch. Most resumes I have seen from that cultural background are fairly accurate, sometimes on the edge of being an understatement.

Being an interviewer myself for some time I have seen resumes of people from many different cultural backgrounds. When assessing those candidates capabilities I found interesting differences and similarities. Candidates with similar capabilities would describe themselves very differently in the resume depending on cultural background. It is common in some cultures to list each and every software or tool one has ever seen or read about. In other cultures people do exaggerate noticeably on their responsibilities and their role titles. Then there are cultures in which people barely list any accomplishments unless those were clearly superb and widely praised. Reading and evaluating resumes is an act of intercultural communication and requires awareness of diverse backgrounds. However, I’ve rarely seen people deliberately making false statements in a resume.

When writing my resume I was aware of the fact that it may be perceived very differently. I tried to find some middle ground erring on the side of accuracy and understatement.

Conclusion

An interesting learning was that resume writing is actually pretty hard. The more the review hurts, the better the resume gets. But I hated it when I was told how not great I managed to present myself at times. I still think it was worth it, but let’s be honest, writing the resume was not a fun activity!

In the end my resume didn’t even fully cover a single U.S. Letter or DIN A4 page. Perfect size!

Motivated me applied for three SRE roles. As far as I know this is the maximum number of roles that one can apply to at the same time. Better be safe than sorry! 🤪

The Organization, Culture, and Me

I believe that seriously interviewing with an organization should be a two-way endeavor. The organization has an interview process in place that is designed around gathering signal from the candidate. Most organizations have worked out criteria that enable success within the organization’s culture, adhere to the organization’s values, support the organization’s mission, and respect the broader environment and market the organization operates in.

As a candidate, on the other side, it wasn’t immediately clear to me which organization or company best matches my personal values. I wasn’t sure about the size and scale at which I would love to work or the culture I would feel most comfortable in. In short, I did not know which company was the right one for me. I sensed that not only mid-sized startups but also Google would be a good match. I had to get to know Google to find out. But how does one get to know Google? I can’t interview the organization and find out if it matches my personal values, or can I?

The Goomics book, an unexpected deep dive into Google’s culture as seen by an engineer.

I have a question…

If I had the chance to interview Google my questions would be along these lines:

What are Google’s values? What is the mission? Is it any good? Do they really mean it? How has the company dealt with failure in adhering to its own values? Did they get back on track? Are reliable corrective forces at work? Is there a system of checks and balances in place?
What is the culture like? Will I be able to strive there? Will working in this culture be fun or will it result in negative stress?
What is the scale at which Google operates? Will I be a part of building something great or will I end up doing something unfulfilling in some giant machinery?
How are decisions made? To what extent would I be able to influence decisions?
What is the state of technology at Google? Is the company drowning in complexity and tech debt or are they doing most things right?
What excites me so much that I would consider giving up my job and role which I like very much?
What does managing look like at Google? Will I be getting an army of Minions or do I have to earn respect and build a fellowship? How is authority established and are people empowered to challenge authority to move everyone forward?
How is it different from my current job? Will I gain autonomy or will I lose some? Will someone watch me moving tickets from left to right on a board or would I be able to define and execute my own projects?
Finally: Are there any open positions?

Answers

Luckily, Google is a mostly open company with the exception scale and assigned headcount. It is almost impossible to get concrete numbers on the actual scale of things or how many engineers are assigned to a certain product or technology. All I knew was that Google has many computers and smart engineers working on the forefront of technology.

Most other topics, however, are not not so secret. Most of them are documented in books, blog post, and conference talks. Especially in the realm of SRE, Google is very open and regularly shares technology advancements, lessons learned, and advocates for operational best practices. The must read to understand the cultural aspects of SRE at Google is Site Reliability Engineering - How Google runs production systems by Betsy Beyer, Chris Jones, Jennifer Petoff, and Niall Richard Murphy.

Younger (and arguably cooler) me visiting *The Googleplex* (Mountain View) in 2013. Five years later I would apply for a job with the company.

Google scientifically researched how management is helping (or not helping) the company to achieve its greater goals. For aspiring managers who think about joining Google: The results are publicly available under the name re:work.

Another area that the company is particularly open about is what it calls PeopleOps (read: Human Resources). Most of my questions regarding the workplace culture and processes where answered in Work Rules by Laszlo Bock. The book also goes into great detail regarding the influence employees have in identifying problems and demanding improved working conditions from management via internal surveys. The results are shared widely within the company and encourage leaders to address issues in a timely manner.

I furthermore learned How Google Works from Eric Schmidt and Jonathan Rosenberg. I got an entertaining, sometimes thrilling history lesson from reading In The Plex by Steven Levy.

The last one was such an addictive read that I more than once missed my station when reading in the subway. I had to walk the extra mile…

Summary

To minimize surprise I passively interviewed Google as thoroughly as they would later interview me. They passed. 😂 Fair game!

An Infinite Ocean of Knowledge

Overwhelmed with the sheer amount of books, articles, and papers to read, a myriad of talks and videos to watch, and an ever-increasing number of things to learn I almost gave up. I was frustrated by the chaos that I have created out of open tabs, bookmarks, handwritten notes, and stacks of books on my desk. Instead of motivating and inspiring me the massive amount of available data had a chilling effect on me. It was time for reflection!

My growing island of knowledge

I figured that there were three components in my learning equation. The seemingly infinite ocean of available knowledge which is unknown to me. These were my “unknown unknowns”. Then there was my starting point of things I knew, my “knowledge”. Let’s think of knowledge as an island somewhere in the ocean of unknowns. The coastline of this island is where I knew there was more stuff to learn. Knowing that there are so many things I did not know about was intimidating and demotivating.

Let me give you an example: I knew that multiple CPU architectures existed. I was quite familiar with an older Infineon chip and the x86 architecture. This was all comfortably sitting on my island of knowledge. Closer to the coastline were some facts I remembered about ARMv7. Something close to my island but I would need to extend the coastline a little bit to claim that piece of land from the ocean of unknown. I also had a bookmark about OpenRISC, so while this architecture was pretty unclear to me I knew where to start to make it part of my island as well.

The interesting part is what happened when I rapidly grew my island of knowledge. The more I learned the less educated I felt. I learned about a new technology, a different approach to a well-understood problem, a crazy programming language leveraging unconventional concepts.

Drawing of an ocean with a larger island

This all grew my island but at the same time made my coastline longer. Being aware of even more “unknowns” made me feel stupid. Everything on my coastline seemed interesting, relevant, and a must-learn before interviewing. Although I was moving forward towards my goal it felt like moving away from it faster and faster. It was time to break the vicious circle! I was in need for a more structured approach.

A method to structure the curriculum

The first thing I did was to acknowledge that I will never be able to learn all the things. I would discover more interesting topics than I would ever be able to focus on. My only way out of that dilemma was to prioritize. It became clear to me that I learned most about a fresh topic when applying the new knowledge to a problem. Discovery along the way was a wonderful yet dangerous issue. I wanted to avoid going too deep down the rabbit hole and thus ignoring other important topics. On the other hand, discovering new topics and fresh aspects is an important part of learning and curriculum building. My solution to this was that I allowed myself to quickly dive into an interesting topic as it came up. After about five minutes I would stop investigating further and add the topic or link at the end of my curriculum. My curriculum was growing every day.

Once a week I would go over the curriculum and remove the topics that I sufficiently worked on. The rest would be subject to rigorous prioritization. I would match the topics against the role description to find out which were the most important to work on. I also tried to strike a balance between the different interview types I expected. It’s no use becoming a coding expert but missing out on troubleshooting topics. That said, between the first recruiter screening call and the phone interviews I gave more attention to coding and management. Mostly because I knew these two topics would be part of the first round of interviews.

Discover - Prioritize - Lern - Apply — After initially discovering topics I weekly re-prioritized my curriculum while it was fed with new items during the learning and application phases.

Prioritization took some time but it was absolutely worth it! Often the topics that I felt the least comfortable with made it to the top of the list. Without prioritization they would probably have been skipped over time and again. There would have been a blank spot in my knowledge. Or, if you like to stay with the metaphor, a lake within my island.

A worked example

Let me share a concrete example of how I applied the four phases to my coding interview curriculum. Prepare for some lightweight graph theory. 🙄

Initial discovery

For the initial discovery phase I browsed websites full of coding questions and tried to find out which types of questions would repeat often. I also read Computer Science Distilled by Wladston Ferreira Filho. I liked that book because it is not offering much in-depth discussion. This made it perfect to get an overview and occasionally find a knowledge gap.

Three books on a surface. — High-level books like “Computer Science Distilled” or the famous SRE book are full of keywords and pointers to topics that are worth looking into.

For example, I wasn’t remembering having ever heard of the knapsack problem. The knapsack problem consequently made it onto my list. I also figured out that for most problem classes and algorithms I never went beyond reading pseudo code implementations. Implementing at least a few of those myself, like breadth-first and depth-first searching on a graph, seemed important and resulted in further items for my growing list.

Prioritization

After a couple of days of discovering about a hundred interesting topics for ramping up my coding learnings I went thorugh the prioritization phase. I reserved time in my calendar for learning sessions for the following week. That was my upper resource limit. Then I had to fill the available time with the most valuable topics first. A topic was valuable if it was something that would provide me an advantage in an interview setting. That is, going into the implementation details of partition schemes for quicksort is less valuable than knowing the tradeoffs of common sorting algorithms. I did allow a bit of time at the end as buffer. Filling my calendar slots starting with the most valuable topics first was, by the way, a knapsack problem. See what I did here?

Learning

Most of the basic algorithms and problem solving techniques are discussed in the book Cracking the Coding Interview by Gayle Laakmann McDowell. Some of the less popular details, e.g. the aforementioned partition schemes of quicksort, are documented on Wikipedia. The online coding platform HackerRank is also a great place to learn about a particular topic. The neat thing about online coding platforms is that they often come with challenging tests and edge cases which are run by an automated grading systems. I found that feedback to be very useful in determining if I had still a rough or already a good understanding regarding a topic or algorithm. Another great site for getting lost in interesting implementation discussions is Geeks for Geeks.

Most of the mentioned resources have curated courses especially for interview training for those who chose not to compile their own weekly curriculum.

Application

When I applied my learnings to a path finding task I came across the topic of topological sorting in directed graphs. An unexpected discovery that I added to the list for the next prioritization round. Eventually I would find a suitable time slot for it and learn and implement topological sort. Every time I applied my learnings to a problem my awareness and understanding of the underlying topic grew. Over time I became more confident in picking the right approach to a coding problem and was quite comfortable talking about the tradeoffs of competing approaches.

Since the overall goal is providing signal in an interview setting I had to throw myself into that situation as realistically as possible. For starters, I bought a huge whiteboard and solved interview questions from books and online forums on the whiteboard. I had a timer running and mumbled to myself as if I were communicating with an interviewer. That felt awkward in the beginning, especially when my girlfriend was in the same room. After a while we both got used to me mumbling and techno-babbling. A small confidence boost!

Being whiteboard-interviewed by expert interviewer Polarbärli on breadth-first searching an undirected graph.

Another great way of applying knowledge was to have mock interviews. Since I had no one who could mock interview me I paid for online mock interview sessions. There are plenty of providers out there and not all of them are worth the money. For example, some just connect two learners with each other. This is missing the point of applying the knowledge with a trained interviewer present. While I don’t feel comfortable recommending any provider, I have to say that I have received excellent service and was asked unique questions at TechMockInterview. Your mileage may vary.

Summary

Preparing for Google interviews can be a challenging task due to the many topics that must be covered. Disciplined prioritization and thoughtful planning helped me grow my island of knowledge without jeopardizing my morale as I discovered that there is so much that I didn’t know. Getting out of the comfort zone and applying my learnings in an interview-like setting was a crucial step. Only knowledge that has been applied and can be communicated is likely to create a positive signal during an interview.

Phone Screening Preparation (My Path to SRM)

According to staff.com Google receives 2 million applications per year. An average year has about 260 working days. That leaves us with over 7500 applications per day. Getting and maintaining recruiter attention is crucial to make it through to the next stage.

If a resume catches the recruiter’s attention they will usually arrange a phone call. The phone call is mostly to get in touch and discuss the process in general. The other function of the first phone call is to discuss the timeline for future interviews, e.g. phone interview(s) and on-site interviews. I was able to share my idea of the timeline and synchronize it with the recruiter. I asked for a slowdown of the process since I felt I needed a bit more time to prepare before heading into the first interview.

Phone screening calls often end with a few minutes of quiz-like questions. I have personally experienced this when I interviewed with Google and Facebook. At eGym I regularly sat down with Human Resources to go over our own little quiz and update the questions we asked. I have heard from other companies that they do phone screening quizzes as well.

No interview should be treated as “informal” […]. Screening starts from the first call.

Sangeeta Narayan, former Executive Recruiter at Google

The purpose of phone screening questions is to help recruiters identify candidates that have a higher chance in performing in later stages of the interview process. Often there is a short answer that requires no or little explanation from either side. The questions are not designed to spark a discussion but to provide a quick overview of a candidates strong and weak areas. Many questions are simple knowledge questions. That is great news! It means one can learn the underlying topics and prepare for the phone screening without having to go through intensive phases of knowledge application. In the phone screening stage knowledge scores more points than experience and skills. This changes at later stages, obviously.

I had a slight advantage because I worked on screening questions in my day job. I also managed to find examples of questions asked at various companies through googling, often on pages 3+. Apparently, there is more than a single page of results. 🤯

Here are some questions I found on the web:

What’s a priority queue?
Can DNS use TCP? In which cases DNS uses TCP?
How does 3 way handshake work in TCP?
What’s void *?
What’s the system call for creating files?

Big-O Complexity Chart by bigocheatsheet.com

I went through several blog posts and Quora answers to identify the broader areas that are covered in the screening quizzes:

System calls, most importantly the popular ones like fork(), creat(), open(), read(), write(), exec(), stat(), but sometimes also less popular ones like sbrk() or mmap().
General memory management, e.g. what does malloc() do internally and how to re-allocate and free memory.
File system internals, like the role of inode or dentry.
The lifecycle and the state of processes.
Possibly a bit of scheduling and context switching but less likely the nitty gritty details in that area.
Signaling and some details on how to send and catch signals. Which signals there are and if a particular signal is even visible to a process or is being handled at kernel level.
All sorts 😉of sorting algorithms and their complexity. Stuff that we can find on the Big-O Cheat Sheet mostly.
Networking basics, e.g. address and important field properties of popular protocols (Ethernet, IPv4, IPv6, TCP, UDP)
IPv4 and IPv6 subnetting
Possibly some out-of-the-hip shooting to well-known problems. For example the question could be There’s an array of 10,000 16bit values, how do you count the bits most efficiently? and a good answer is to use an 8-bit lookup table or the Kernighan algorithm
The basics of object oriented programming languages.
What differentiates (statically) typed from untyped languages and how to cast types.

I collected as many questions I could find and even came up with my own questions. Then I answered them on a sheet of paper and read through it a couple of times in the days preceding the recruiter call. During the call I was able to answer most of the questions quickly and I believe a good chunk of them also correctly. Although the questions were different from my prepared questions it helped a lot that I researched the answers and brushed up on basics.

Perspective

If you have just read the above list of topics and feel intimidated 😱 by it let add some calibration here. I was interviewed for a senior role and clearly stated expectations of the level of the role to the recruiter. I believe that the questions I got screened with were appropriate for the level I was looking for. I was expected to have more breadth and more depth than a junior engineer. I claimed 15+ years of working more or less closely with communication or Internet technology. I would expect everyone to get screened at a level that takes into account expertise and seniority stated in the resume. Another reason why lying in the resume just doesn’t pay. You’ll be fine!

Coding Interview Preparation

At first I was overwhelmed by the sheer amount of topics and learning opportunities. From free videos on YouTube with varying quality to paid courses offering a thoughtful curriculum. Out of the many structured and well-designed learning paths towards coding interviews I had to choose one or two. The secret was to follow one or maybe two of them and not waste time jumping back and forth between tens of different curricula. Sometimes it is better to trust that the author put some serious thought into compiling a suitable learning path.

Choosing a curriculum

I wasn’t in a hurry. So I chose to start with a quite thick book titled Cracking the Coding Interview (CTCI) by Gayle Laakmann McDowell. The book contains about 150 coding questions divided into different topics and difficulty levels. The programming language that is used in the solutions section is Java unless the use of a different language was particularly stated. Since I wanted to get better at Go I solved the coding questions using Go. Almost all question could be solved with Go. I’d argue that this often produced cleaner, easier to read, code. But that’s just me being opinionated. 😉For what it’s worth, others have gone there before and published their solutions for Go. There is also a semi-official repository containing solutions for all sorts of languages.

Python source code solving The Towers of Hanoi — I found it much easier to visualize a solution on a whiteboard than in a code editor.

Train as you fight!

Train as you fight we used to say in the army, meaning that the more realistic the training the higher the chances of success in a real situation. Trusting that simplified sayings hold true I ordered myself a whiteboard and board markers. The whiteboard became my center of learning. Every time I finished reading a chapter in CTCI I went to the coding questions for that chapter and solved them on the whiteboard. I set myself a timer for 40 minutes for each question, just like in a real interview. I talked out loud while solving the question as if there was an interviewer in the room. Sometimes it was hard to rephrase a problem or explain my approach in English. However, I think I figured it out over time and also improved my technical English skills along the way. More often than not I was stuck with a question. Since there was no real interviewer in the room I had to use the hints in the book to bootstrap myself out of the misery. CTCI has really nice hints for each question. The hints they start broad and get narrower with each further hint releasing more and more information about the problem. My goal was to use as few hints as possible. The author arranged the hints randomly on different pages. This means that looking up a hint did not lead to accidentally peeking at further hints for the same question. I liked how much thought went into that book! 👍

In the beginning I was unable to finish even the easy and moderate questions in time. Don’t get me started on the hard ones… When I hit the time limit I noted down how far into the solution I made it (for example by taking a picture of the whiteboard). Then I continued solving the question. Later, when I finally gave up on the problem or thought I had a decent solution I compared that to the expected solution. Over time I made it further and further into the solutions, eventually finishing questions on time. 🙋🎉

It was very important for me to raise awareness of my progress to keep pushing and not get frustrated. By measuring time and how far I came with each question I was able to maintain the momentum over weeks.

Two arrays of integers and some code made out of while loops. — I sometimes had to redo some exercises because my initial solution wasn’t good enough.

If you don’t have a whiteboard and prefer not to get once I advise to code on paper or in a plain text editor (no syntax highlighting, no auto-correction). Nowadays Google often offers coding on a Chromebook in an on-site interview. As far as I know candidates can choose if they prefer a whiteboard or a Chromebook. I heard stories were people prepared on a keyboard only to find out that there were no Chromebooks available at the site they interviewed. I’d say preparing on a keyboard only comes with a small risk while preparing on a whiteboard or on paper is usually a safe bet.

Testing

I did not have a learning buddy. Therefore I had to pay people to mock-interview me. If you happen to know someone who is also preparing for coding interviews I highly recommend pairing up and mock-interviewing each other about one a week. Experiencing an interview from the interviewer’s point of view is a great learning. It improves one’s own skills and helps to get better in providing useful signal. Having an interview once a week furthermore reduces the uncertainty of the interview situation in general. After a while one gets used to the circumstances and it becomes easier to focus on solving the interview question. Being interviewed is stressful enough in itself, there is no need to add extra adrenalin by experiencing this situation for the first time when it counts. Ideally it becomes routine and almost boring to be interviewed.

Resources

Curricula and quizzes:

Cracking the Coding Interview book
Big-O Quiz
If you are coming from JavaScript land you may find a good start in Lydia Hallie’s introduction to algorithms and data structures.

Sources of problems, questions, and exercises:

Hacker Rank is full of interesting coding questions. Most of them have test cases. An automated grading system provides instant feedback which is tempting but misses the point of an interview where no or little feedback is given.
Project Euler features mostly math-related questions. They tend to get tougher over time.
The CareerCup forums host an ever-growing collection of coding interview questions. Here are a few for Google SRE. Geeks for Geeks never let me down when it comes to finding the, interesting problems to solve.

Technical deep dives and algorithms explained:

Occasionally, the Introduction to Algorithms may come in handy. Famously abbreviated CLRS after its authors’ initials.
I liked Hacker Rank’s Data Structures and Algorithms playlists on YouTube.
The Computerphile Channel features advanced topics. They are often buried under other less advanced but also interesting videos. I particularly liked the Paxos video and the A* algorithm explaination.

Troubleshooting Interview Preparation

I found the troubleshooting interview preparation to be one of the more fuzzy ones. The first question I asked myself was: What is Troubleshooting anyway?

Cleverly combining talking to my recruiter with sophisticated Intawebz research I came up with the following definition:

Troubleshooting in the context of interviewing is the ability to approach problem solving in an educated, logical, and structured way. It requires communicating by sharing thoughts and ideas with the interviewer while working through a (potentially networked) distributed system scenario.

I have been running my own servers for a long time. Only recently I reduced my Internet footprint and handed some task over to grown-ups such as Gmail and Netlify.

For many, sometimes painful, often exciting and always educating years I ran an Autonomous System, mail servers, web servers, version control, build pipelines, and of course configuration management. That inevitably left some scars but also contributed to my troubleshooting skills. However, my motivation was usually operational. I wanted to get something fixed and learned just as much of the basics that I would need to get the job done. After all, this was all a hobby project and I had only so much time. Therefore, enhancing my skills was advised. I figured I need to do a couple of things to really call myself a troubleshooter:

Deepen my understanding of how computers work in general.
Revisit how hardware and the operating system interact in certain constellations.
Refresh a little on what makes a modern operating system. For example, update my knowledge on memory management, device management, scheduling and processes, synchronization, file systems and network drivers.
Broaden my portfolio of tools used to inspect metrics and be able to reason about the metrics they show me. What does load average mean, anyway?
Knowing more about common bottlenecks and how to verify them. For example, Is this slow process disk IO-bound or is it suffering from memory pressure effects? or Which one of these processes is the cause for frequent context switching and how could it become a better behaving citizen?
Refresh a little on some networking edge cases, such as TCP window size problems or how the buffer queues look like during retransmit. I would call computer networking one of my stronger skills and spent almost no time in refreshing those skills. If I’d be new to computer networking I would read the TCP/IP Illustrated books by William Richard Stevens and Computer Networks by Andrew S. Tanenbaum.
I needed to learn more from my own and others failures. That is, reading post mortems, going through reports on user mailing lists for distributed systems open source projects, and watching talks from conferences where folks showed off how they overcame bottlenecks and performance issues.
Whenever possible I needed to practice troubleshooting. Pairing with others in troubleshooting can be very educating. I have learned about new tools and different approaches by watching over the shoulders of experienced troubleshooters.

Bite Size Networking by Julia Evans is a comic-style approach to learning about troubleshooting tools.

Troubleshooting

As so often when facing a new challenge the main question was: Where to start? Luckily the Internet had me covered. I compiled a list of related websites, tweets, mailing list archives, source code, and videos. They are entry ways to deep rabbit holes of knowledge often containing fascinating distractions in the form of useless niche knowledge. Great discipline was necessary to stay on track and not get trapped in an infinite loop of learning. I managed to pull the brakes most of the time. Here are some of the rabbit holes that I went down into:

Julia Evans publishes short comics about useful system administration tools. The comics are now also available as books.
Brendan Gregg’s Linux performance page was a good starting point to learn more about operating system telemetry and analysis tools. He delivered plenty of talks on system performance such as Linux Performance Tools (Part 1) and Linux Performance Tools (Part 2) which I enjoyed.
Occasionally interesting posts related to troubleshooting would pop up on company blogs. I particularly remember Linux Kernel Bug Hunting by booking.com.
The free Advanced Operating Systems self-learning course by the Georgia Institute of Technology is something I briefly looked into.
I made sure I am not missing an important piece of the puzzle by watching videos from Brian Will’s Hardware Basics YouTube playlist and skipping through this video about Operating System Basics.
I loved the Computer Systems Engineering lectures by the Massachusetts Institute of Technology. I watched the whole course three times and every time I learned something new. The value is not only in the lecture content, but in pausing the lectures every now and then and start thinking about the underlying problems.
There’s been some research on How complex systems fail.
My recruiter recommended that I take a look at Life in App Engine production to learn more about troubleshooting real life systems.

UNIX and Linux Internals

On top of the technical skills, the Site Reliability Manager interview track has to test for management and leadership capabilities. Three out of my seven interviews were in the management realm leaving only four interview slots to assess me technically. Maybe that is the reason why my troubleshooting interview also dug into UNIX and Linux internals. I heard that for some roles in SRE a dedicated UNIX and Linux system internals interview is scheduled.

Since it had been a while since I had read or written kernel code I was in for a refresh. Here are my starting points as good as I can remember:

I read the book The Unix Programming Environment by Brian W. Kernighan and Rob Pike. It’s a fascinating mix between UNIX history and solid explanation of the underlying ideas and principles. Even if it sounds boring here and there to the experienced administrator. There are hidden pearls in every chapter.
The Linux Programming Interface by Michael Kerrisk is a thick book. I have only looked at some chapters since it is so overwhelming. However, the author offers multi-day training courses on the topic. I have considered getting one but we were not able to find a suitable date. I keep an eye on it since I still want to get that course.
Unbeknownst to some, the Linux kernel comes with its own documentation. It sits right next to the source code awaiting the interested reader. For example: spinlocks.rst
I found browsable source and clickable identifiers helpful in skimming through some kernel code. I remember looking at inodes and the ext4 superblock. Why don’t you try to find out more about the difference of fork() and clone() by finding these syscalls in the source code of Linux v5.1?
I watched some videos from YouTube playlists UNIX terminals and shells and UNIX system calls to check that I wasn’t missing a crucial concept.

TCP handshake diagram — Thinking about TCP retransmits wasn’t an every-day activity for me and needed a quick refresh as well.

Conclusion

My troubleshooting interview went reasonably well. I think I could have found the problem faster if I took a more structured approach but eventually I found the problem and carefully mitigated it without harming production processes which ran on the same (fictional) machine.

As a closing note may I repeat myself in saying that interviewing is most and foremost about communication. This applies even more so to the troubleshooting interview which is particularly communication heavy. The technical skills are a must have but they alone will not be sufficient.

Non-abstract Large System Design Interview Preparation

Unfortunately, Non-abstract Large System Design (NALSD) is a lesser known concept to most of the engineering world. When I started preparing for the NALSD interview I found resources on pure NALSD to be rare. Therefore, I primarily used the traditional system design wisdom to prepare for this type of interview. I found tons of material on traditional system design and respective interview questions including recorded mock interview sessions. I used those resources to get a better understanding of how distributed systems are designed nowadays. Upon that I built an understanding for NALSD by attaching realistic numbers to the abstract designs I read about.

NALSD describes a skill critical to SRE: the ability to assess, design, and evaluate large systems. Practically, NALSD combines elements of capacity planning, component isolation, and graceful system degradation that are crucial to highly available production systems. Google SREs are expected to be able to start resource planning with a basic whiteboard diagram of a system, think through the various scaling and failure domains, and focus their design into a concrete proposal for resources.

The Site Reliability Workbook

Databases and services in a diagram drawing. — Once again expert interviewer Polarbärli had to critically look at my questionable drawings.

I highly recommend ingesting the excellent chapter on NALSD from the Site Reliability Workbook. Reading the chapter repeatedly and running the numbers myself on a whiteboard helped me understand how NALSD is different from traditional system design. I thought of some questions myself and tried to run them through the same schema. Those included:

Design an image sharing service like Imgur and come up with a bill of materials for serving 50.000 Queries per second (QPS).
Design a log ingestion service like Stackdriver including indexing pipeline and frontend. Highlight the tradeoff differences of the components.
Design an approach to distributed rate limiting of an API that can handle one million QPS hitting the API endpoints. Optimize for less cross-regional bandwidth and come up with a reasonable bill of materials.

It’s not too hard to come up with questions yourself. Just make sure they are non-abstract, meaning there are numbers to deal with (e.g. QPS) or requirements that need to be met (e.g. must be globally distributed, requires strong consistency). Also, force yourself to come up with a bill of materials at the end. In my experience it does make a huge difference if one is just drawing diagrams compared to thinking about the hardware impact of design decisions. Each piece of hardware might be a small failure domain of its own and needs to be dealt with accordingly.

After a couple of days digging into the topic I focused on how a system design interview looks like. Although the tech is a necessary requirement for mastering a NALSD interview I figured that outstanding communication between me and the interviewer is probably the most valuable signal I can provide.

Sebastian Kirsch’s talk Interviewing for System Design Skills provided me with insights on how Google runs SRE system design interviews. Of course they would not ask me the very same questions that have been discussed publicly in the talk. But having an idea what is important for the interviewer made me better at providing positive signals during the interview. The question I got in my on-site NALSD interview was much more complex than the ones from the talk. However, it was less complex than the one discussed in the Site Reliability Workbook. I found the actual question very interesting and engaging. In fact, when the time was up both, the interviewer and I, were sorry we couldn’t discuss a further iteration. I was full of ideas for further optimization and we both discussed them on my way out of the building. While this was the interview that I was most scared of, it felt like the most productive and fun one.

I can’t write about NALSD without pointing to Jeff Dean’s talk about Building Software Systems at Google. While this isn’t the newest publication on the topic of distributed systems it does highlight interesting challenges Google had to overcome.

Furthermore, I would like to point out that preparing for the NALSD interview benefits hugely from pairing up with a learning buddy. Like whiteboard coding designing a distributed system is neither easy nor does it come with a single solution. For me the learning in NALSD often happened when I was exposed to different approaches or perspectives. This is why I highly recommend looking for an interview buddy at this stage to start bouncing ideas back and forth.

Management and Leadership Interview Preparation

I was a bit surprised by how thoroughly Google tested my management and leadership skills. Three out of the eight interviews were about management and leadership. That gave me a strong hint that these interviews are similarly important to Google as the technical interviews. I had to reflect about my past leadership experiences in the army, about management in non-profit organizations and my team leading in the realm of event management. I also had to think about how I approached education (which I am passionate about), coaching and mentoring. If I wanted to demonstrate those skills in an interview I knew, I’d better have my story straight and be aware of what I am and am not good at.

Dan leading a group of camouflaged soldiers. — The first group of recruits [blurred] which I had the honor to lead. We all make an angry face to scare the photographer. Shot with a potato cam during an exercise.

I started my self-explorative journey by writing down two or three STAR-formatted stories for each of the following questions:

How did I handle conflict in the past? Which situations did I approach understanding, democratic, decisive, thoughtful, or with clear priorities that are in line with my values? How did that work out for each of them?
How did I approach performance management in a team? How did I deal with under-performers in my team?
How do I lead? How does my leadership style work in different situations? What does “being a leader” mean to me?
What are my values? What are the principles that guide my decision making when push comes to shove?
Looking at past challenges: What went well and what went wrong?
How did I ensure safety? In a tech job this is primarily about psychological safety but when I served my country this often included the physical safety of my team as well.
How did I develop a vision (pro tip: never alone) and how did I share the vision with a group of people.
What opportunities did I have to inspire people and how I (often unconsciously) did inspire people? To be honest, I do not really understand how it happens but I have been told to be inspiring to some.
How did I overcome a difficult situation?
What is a tough problem I solved? What aspects of leadership and management did I apply to the situation?
How did I create, establish, or support a healthy culture? What makes a healthy culture in my opinion anyway?
How did I assess critical situations?
How did I drive impact in the past? How did I measure the impact? How did it turn out in the long run?
How did I foster creative thinking?
How did I drive improvements, influenced, and took ownership?
And lastly, what are some success stories of mine showing communication skills, working under pressure, working in a team, and executing complex, technical projects? You gotta self-advertise a little in those interviews I thought. It didn’t hurt. 👍

That left me with a large bank of thought-through answers. I went over the bank a couple of times to make sure I remember the key facts and have them ready when needed. My idea was to not start reflecting during the interview but to have pre-reflected signal ready to share.

Regarding the craftsmanship of management I revisited popular project management frameworks and methodologies and thought about what I liked and disliked about each of them. I spent some time thinking through challenges that may require a particular framework.

Another topic that I brushed up my knowledge about was German labor law. Just enough to make sure I don’t come up with crazy off-the-table solutions to hypothetical questions. I wasn’t asked anything in that direction in the actual interviews. 🤷

Finally, I thought about common questions and how I would answer them, including:

Why did I want to leave my old job? (I did not really, although I wished I had more budget and autonomy)
Why did I want to work at Google? Putting aside all the typical “it is considered so cool” reasons, what exactly did me get excited about the company and the mission? I wrote about that extensively in The Organization, culture, and me.
What were my salary expectations? I found this hard to answer and provided a ballpark number only. Google was, unlike Facebook, happy to meet me there. Maybe I low-balled myself? 🤨

Management? Seriously?

Knowing that I am a capable engineer I could easily find fulfillment in engineering work. So why did I want to move to management and give up the joy and satisfaction of working with technology all day long? That turned out to be a tough question and I’m not entirely sure I have found the answer yet. Let me try to sketch out my thought process for your entertainment.

Management was always a bit out of my comfort zone, yet, I did lead or manage small groups from my early days in school up until now. Whenever I went out of my comfort zone I learned a new skill or improved on something unexpected. I enjoyed those moments and find they are worth the struggle of being slightly uncomfortable at work.

The crucial question was: Am I ready and willing to give up the technical work I love to serve as a manager? 🤔

Weaknesses

Please note that this section is a rant. ☝️ According to the all-knowing dumpster fire the Internet is, a particularly popular question in interviews reads: What are your weaknesses? Some experts advise answering the question in a mildly manipulative way by making it actually about a strength. For example, by saying something like this:

My weakness is that I am not very patient and that I want to get things done.

I call bullshit! 💩 If you really think that is your greatest weakness I command you back to self-reflection class 101. Let’s assume that after spending a while with yourself you came to the conclusion that this is really your greatest weakness. Then please elaborate on how it affects your team, your stakeholders, how you raise awareness of your weakness. Also, I would hear about the measures you plan to put in place to prevent your impatience damaging the team, the culture, or the organization’s goals? If you can’t honestly talk about your weaknesses, how will you be able to manage them? Why would you be afraid of self-reflection but willing to take over responsibility for a whole team?

Here is what I believe represents much better answer:

One of my weaknesses is something I would call a decision making muscle memory. I sometimes have a hard time re-thinking a challenge from a different perspective. I have to constantly remind myself to listen to other people’s opinions and think a seemingly understood problem through from their perspective. This has to happen before I make a judgement call otherwise I limit my own thinking. I mitigate this by reminding myself to pause and think before making a decision. It is this short pause that I need to gain perspective.

Or even this one:

A constant challenge is to overcome introvertism. Many human interactions are incredibly stressful for me. Even day-to-day interactions like making small talk to the barista while getting a coffee is draining my people battery. I learned to overcome my shyness and fear but it does not come easy. Additionally, I learned how to recharge quickly and efficiently. Fitness and strength training is one of the me-time activities that energize me. Another activity I enjoy is cycling alone on a summer day and reflect on the past days. Just as much as possible without causing a road accident. I also deeply enjoy flying in a passenger jet without anyone talking to me while I look out of the window and imagine going beyond the stratosphere one far away day.

I believe it is important to know ones weak spots. Once known, weaknesses can be worked on, self-awareness can be increased, and the possibly negative consequences more effectively dealt with or mitigated. Tech companies love their employees to be creative on the process and mindful regarding their actions. Showing self-reflection skills in an Interview is a positive signal to the interviewer. Honesty and self-awareness is the only way to turn a weakness into a meaningful strength in an interview situation.

Leaders are Readers

I love to say readers are leaders and leaders are readers. There’s just so much truth to that. Here’s a list of books that influenced my management style and that I can recommend:

Five dysfunctions of a team by Patrick M. Lencioni
The seven habits of highly effective people by Steven R. Covey. It is my strong believe that leadership requires constant critical reflection and self-development. This old, but very applicable book helped and still helps me to shape my personal values, principles, and eventually myself. It’s one of the books one should read every 5 to 10 years.
First 90 days by Michael Watkins. It’s full of hands-on support for getting traction after a transition. That applied to me as I was seeking a role change from Tech Lead to Manager
The Trillion Dollar Coach by Eric Smith, Jonathan Rosenberg, and Alan Eagle. An easy read, something for the summer vacation, but full of truths and interesting stories about Silicon Valley legend Bill Campbell.
The Managers Path by Camille Fournier

Table of Contents

Am I Even Qualified? (My Path to SRM)

The (un)importance of certificates

SRE WTF?

Getting my hands dirty

The Outcome

What does it mean to be qualified?

Further Reading

Writing my Resume

Meaningful impact

Example

Playing the “fair game”

Resumes and cultural background

Conclusion

Further Reading

The Organization, Culture, and Me

I have a question…

Answers

Summary

An Infinite Ocean of Knowledge

My growing island of knowledge

A method to structure the curriculum

A worked example

Initial discovery

Prioritization

Learning

Application

Summary

Further Reading

Phone Screening Preparation (My Path to SRM)

Perspective

Coding Interview Preparation

Choosing a curriculum

Train as you fight!

Testing

Resources

Troubleshooting Interview Preparation

Troubleshooting

UNIX and Linux Internals

Conclusion

Further Reading

Non-abstract Large System Design Interview Preparation

Further Reading

Management and Leadership Interview Preparation

Management? Seriously?

Weaknesses

Leaders are Readers

Further Reading