Unfortunately, Non-abstract Large System Design (NALSD) is a lesser known concept to most of the engineering world. When I started preparing for the NALSD interview I found resources on pure NALSD to be rare. Therefore, I primarily used the traditional system design wisdom to prepare for this type of interview. I found tons of material on traditional system design and respective interview questions including recorded mock interview sessions. I used those resources to get a better understanding of how distributed systems are designed nowadays. Upon that I built an understanding for NALSD by attaching realistic numbers to the abstract designs I read about.
NALSD describes a skill critical to SRE: the ability to assess, design, and evaluate large systems. Practically, NALSD combines elements of capacity planning, component isolation, and graceful system degradation that are crucial to highly available production systems. Google SREs are expected to be able to start resource planning with a basic whiteboard diagram of a system, think through the various scaling and failure domains, and focus their design into a concrete proposal for resources.
Once again expert interviewer Polarbärli had to critically look at my questionable drawings.
I highly recommend ingesting the excellent chapter on NALSD from the Site Reliability Workbook. Reading the chapter repeatedly and running the numbers myself on a whiteboard helped me understand how NALSD is different from traditional system design. I thought of some questions myself and tried to run them through the same schema. Those included:
- Design an image sharing service like Imgur and come up with a bill of materials for serving 50.000 Queries per second (QPS).
- Design a log ingestion service like Stackdriver including indexing pipeline and frontend. Highlight the tradeoff differences of the components.
- Design an approach to distributed rate limiting of an API that can handle one million QPS hitting the API endpoints. Optimize for less cross-regional bandwidth and come up with a reasonable bill of materials.
It’s not too hard to come up with questions yourself. Just make sure they are non-abstract, meaning there are numbers to deal with (e.g. QPS) or requirements that need to be met (e.g. must be globally distributed, requires strong consistency). Also, force yourself to come up with a bill of materials at the end. In my experience it does make a huge difference if one is just drawing diagrams compared to thinking about the hardware impact of design decisions. Each piece of hardware might be a small failure domain of its own and needs to be dealt with accordingly.
After a couple of days digging into the topic I focused on how a system design interview looks like. Although the tech is a necessary requirement for mastering a NALSD interview I figured that outstanding communication between me and the interviewer is probably the most valuable signal I can provide.
Sebastian Kirsch’s talk Interviewing for System Design Skills, which by the way I attended in person, provided me with insights on how Google runs SRE system design interviews. Of course they would not ask me the very same questions that have been discussed publicly in the talk. But having an idea what is important for the interviewer made me better at providing positive signals during the interview. The question I got in my on-site NALSD interview was much more complex than the ones from the talk. However, it was less complex than the one discussed in the Site Reliability Workbook. I found the actual question very interesting and engaging. In fact, when the time was up both, the interviewer and I, were sorry we couldn’t discuss a further iteration. I was full of ideas for further optimization and we both discussed them on my way out of the building. While this was the interview that I was most scared of, it felt like the most productive and fun one.
Designing a distributed system may feel like playing with bricks. An ingenious puzzle.
I can’t write about NALSD without pointing to Jeff Dean’s talk about Building Software Systems at Google. While this isn’t the newest publication on the topic of distributed systems it does highlight interesting challenges Google had to overcome.
Furthermore, I would like to point out that preparing for the NALSD interview benefits hugely from pairing up with a learning buddy. Like whiteboard coding designing a distributed system is neither easy nor does it come with a single solution. For me the learning in NALSD often happened when I was exposed to different approaches or perspectives. This is why I highly recommend looking for an interview buddy at this stage to start bouncing ideas back and forth.
- Designing and Operating Highly Available Software Systems at Scale
- The System Design Primer is a repository containing the basics of system design, flash cards, example questions, and solutions.
- A list of engineering blogs of companies building distributed systems
- Must-read papers and thought-provoking reads
- Todd Hoff occasionally posts about system design in the High Scalability blog.
- How to Succeed in a System Design Interview