Troubleshooting Interview Preparation (My Path to SRM)

I found the troubleshooting interview preparation to be one of the more fuzzy ones. The first question I asked myself was: What is Troubleshooting anyway?

Cleverly combining talking to my recruiter with sophisticated Intawebz research I came up with the following definition:

Troubleshooting in the context of interviewing is the ability to approach problem solving in an educated, logical, and structured way. It requires communicating by sharing thoughts and ideas with the interviewer while working through a (potentially networked) distributed system scenario.

I have been running my own servers for a long time. Only recently I reduced my Internet footprint and handed some task over to grown-ups such as Gmail and Netlify.

For many, sometimes painful, often exciting and always educating years I ran an Autonomous System, mail servers, web servers, version control, build pipelines, and of course configuration management. That inevitably left some scars but also contributed to my troubleshooting skills. However, my motivation was usually operational. I wanted to get something fixed and learned just as much of the basics that I would need to get the job done. After all, this was all a hobby project and I had only so much time. Therefore, enhancing my skills was advised. I figured I need to do a couple of things to really call myself a troubleshooter:

  • Deepen my understanding of how computers work in general.
  • Revisit how hardware and the operating system interact in certain constellations.
  • Refresh a little on what makes a modern operating system. For example, update my knowledge on memory management, device management, scheduling and processes, synchronization, file systems and network drivers.
  • Broaden my portfolio of tools used to inspect metrics and be able to reason about the metrics they show me. What does load average mean, anyway?
  • Knowing more about common bottlenecks and how to verify them. For example, Is this slow process disk IO-bound or is it suffering from memory pressure effects? or Which one of these processes is the cause for frequent context switching and how could it become a better behaving citizen?
  • Refresh a little on some networking edge cases, such as TCP window size problems or how the buffer queues look like during retransmit. I would call computer networking one of my stronger skills and spent almost no time in refreshing those skills. If I’d be new to computer networking I would read the TCP/IP Illustrated books by William Richard Stevens and Computer Networks by Andrew S. Tanenbaum.
  • I needed to learn more from my own and others failures. That is, reading post mortems, going through reports on user mailing lists for distributed systems open source projects, and watching talks from conferences where folks showed off how they overcame bottlenecks and performance issues.
  • Whenever possible I needed to practice troubleshooting. Pairing with others in troubleshooting can be very educating. I have learned about new tools and different approaches by watching over the shoulders of experienced troubleshooters.

Bite Size Networking by Julia Evans is a comic-style approach to learning about troubleshooting tools.

Troubleshooting

As so often when facing a new challenge the main question was: Where to start? Luckily the Internet had me covered. I compiled a list of related websites, tweets, mailing list archives, source code, and videos. They are entry ways to deep rabbit holes of knowledge often containing fascinating distractions in the form of useless niche knowledge. Great discipline was necessary to stay on track and not get trapped in an infinite loop of learning. I managed to pull the brakes most of the time. Here are some of the rabbit holes that I went down into:

UNIX and Linux Internals

On top of the technical skills, the Site Reliability Manager interview track has to test for management and leadership capabilities. Three out of my seven interviews were in the management realm leaving only four interview slots to assess me technically. Maybe that is the reason why my troubleshooting interview also dug into UNIX and Linux internals. I heard that for some roles in SRE a dedicated UNIX and Linux system internals interview is scheduled.

Since it had been a while since I had read or written kernel code I was in for a refresh. Here are my starting points as good as I can remember:

  • I read the book The Unix Programming Environment by Brian W. Kernighan and Rob Pike. It’s a fascinating mix between UNIX history and solid explanation of the underlying ideas and principles. Even if it sounds boring here and there to the experienced administrator. There are hidden pearls in every chapter.
  • The Linux Programming Interface by Michael Kerrisk is a thick book. I have only looked at some chapters since it is so overwhelming. However, the author offers multi-day training courses on the topic. I have considered getting one but we were not able to find a suitable date. I keep an eye on it since I still want to get that course.
  • Unbeknownst to some, the Linux kernel comes with its own documentation. It sits right next to the source code awaiting the interested reader. For example: spinlocks.rst
  • I found browsable source and clickable identifiers helpful in skimming through some kernel code. I remember looking at inodes and the ext4 superblock. Why don’t you try to find out more about the difference of fork() and clone() by finding these syscalls in the source code of Linux v5.1?
  • I watched some videos from YouTube playlists UNIX terminals and shells and UNIX system calls to check that I wasn’t missing a crucial concept.

Thinking about TCP retransmits wasn’t an every-day activity for me and needed a quick refresh as well.

Conclusion

My troubleshooting interview went reasonably well. I think I could have found the problem faster if I took a more structured approach but eventually I found the problem and carefully mitigated it without harming production processes which ran on the same (fictional) machine.

As a closing note may I repeat myself in saying that interviewing is most and foremost about communication. This applies even more so to the troubleshooting interview which is particularly communication heavy. The technical skills are a must have but they alone will not be sufficient.

Further Reading



🔬 Experimental Feature: Subscribe here to receive new articles via email! 🔬