I’ve started to assemble some resources on the topic of Site Reliability Engineering in order to pass this to colleagues and friends who want to dive into the topic and are in need of good starters.
Aside from specific SRE resources it does make sense to have a common understanding on how infrastructure is run in the modern age. And to be honest: if you think it is simply about the running of infrastructure: nay - think again. We’ve come long ways since the early 2000s. While not on the topic of SRE specifically, I’d recommend giving The Unicorn Project by Gene Kim a whirl. Packaged in an interesting novel lot’s of modern paradigms are passed onto the reader.
Disclaimer: the list of resources is not exclusive and should simply offer a head start into the topic.
One could refer to this book as the bible of site reliability engineering. It is the groundwork piece that is actually referred to in a lot of places:
The subtitle “How Google Runs Production Systems” clearly states what it is about. It is fairly dense. I found it to be well consumed alongside with the book Seeking SRE.
Seeking SRE is made out of chapters written by different authors. Each one of them from the industry and having had some major experiences with SRE in the past. It goes from telling the story of SRE at Spotify, to soundcloud as well as featuring anti-patters of SRE.
The google cloud blog published an article explaining the SRE fundamentals: SLIs, SLAs and SLOs. On the subject of defining SLOs and the pitfalls associated with it Femi Agbabiaka wrote a nice post: SLO pitfalls
Audio material - Podcast episodes
There are many podcast episodes on this subject out there.
The New Stack Makers has a piece on the The evolution of the Site Reliability Engineer.
The Cloud Cast actually has quite a few good episodes that touch the topic.
- SRE and Infrastructure Operations
- Has SRE replaced DevOps?
- Real-World SRE Perspectives
- SRE lessons from the trenches