Site Reliability Engineering: How Google Runs Production Systems

Book written by Betsey Beyer

Book review by Rick Howard

Executive Summary

Site Reliability Engineering: How Google Runs Production Systems is the consummate DevOps how-to manual. Where one of last year’s Cybersecurity Canon Hall of Fame books, The Phoenix Project: A Novel About IT, DevOps, and Helping Your Business, discusses the overarching DevOps concepts in a novel form, Site Reliability Engineering, written by Google engineers, provides all the practical knowledge necessary for how to build your own DevOps program. The only shortcoming is that the authors don’t consider security operations as part of their SRE team and only barely mention how SRE might improve security operations. That said, this is an important book and should be part of the Cybersecurity Canon. It shows the way that we all should be thinking about deploying and maintaining our IT and security systems.

Introduction

The Cybersecurity Canon project is a “curated list of must-read books for all cybersecurity practitioners – be they from industry, government or academia — where the content is timeless, genuinely represents an aspect of the community that is true and precise, reflects the highest quality and, if not read, will leave a hole in the cybersecurity professional’s education that will make the practitioner incomplete.” [1]

Last year, I was blown away by a most interesting book called, The Phoenix Project: A Novel About IT, DevOps, and Helping Your Business Win by Gene Kim, Kevin Behr and George Spafford. [2] As the title implies, it is a primer on the DevOps philosophy. After further research, I discovered that internet giants like Google, Amazon, Netflix, Salesforce and Facebook adopted the concept to dominate their competitors. [3] In my review of the book, I said that “... the concept of DevOps is perhaps the most important innovation that has happened to the IT sector since the invention of the personal computer back in the early 1980s.” [4]

I talk to a lot of network defenders from all over the world. Most say that they have adopted the DevOps philosophy. But when I probe them about the projects they are working on, it is clear that most do not understand what DevOps is. Many think that if they deploy applications to the cloud, they are doing DevOps. This can’t be further from the truth. To help them, I usually hand them a copy of The Phoenix Project and point them to my review of the book. [4] The Phoenix Project is so important that the Cybersecurity Canon Committee inducted it into the Hall of Fame at this year’s ceremony. [5]

But DevOps is a complicated subject. It is tough to get your hands completely around what it might mean to your organization in the future. The concepts are so foreign to what a typical IT staff or InfoSec staff has done traditionally; many network defenders do not know where to begin to transition their own staffs into this new kind of organization. The Phoenix Project is filled with lots of concepts but is a little short on the mechanics of how to do it in the real world. And then I found this book: Site Reliability Engineering: How Google Runs Production Systems by Betsy Beyer, Chris Jones, Jennifer Petoff and Niall Richard Murphy. [6] If The Phoenix Project is a philosophy, then Site Reliability Engineering is the how-to manual.

What Is SRE?

The book is written by a gaggle of Google Site Reliability Engineers (SREs), and tells the story of how they transformed the installation and operation of the Google Search engine webpage back in 2004 into the autonomous system of systems today that makes up the Google juggernaut of search, ads, Gmail, Android, YouTube and other services, along with the distributed undercarriage that keeps it all running.

Google search began on the Stanford University website in August 1996. [7] By August 2004, Larry Page and Sergey Brin took the company public. Right about then, the Google engineering team invented "Site Reliability Engineering” when they asked a software engineer to design and run an operations function. That was a game changer.

The “S” in Site Reliability Engineering refers to the original Google search website. The “R” stands for reliability meaning that out of all the things that the SREs are worried about in maintaining the site, reliability is the most important. They focus on mean time to failure (MTTF) and mean time to repair (MTTR) as key metrics, and they use response playbooks to address recurring issues. Their data suggests that using a playbook vs. “winging it” produces a 3x improvement to MTTR. They use these playbooks to automate progressive software rollouts, to detect problems early, and to safely roll back the system if necessary. Interestingly, they started implementing this DevOps philosophy at least six years before the DevOps community named itself. [4]

We all know the benefits of automating tasks, but the Google SREs have taken it to the nth degree. They realize that it is not a panacea; but it is a force multiplier. Done correctly, it layers a blanket of consistency across the entire organization and, once built, the emerging platform can be easily extended. The platform centralizes mistakes and aids faster repairs.

Google SREs define reliability as: “The probability that [a system] will perform a required function without failure under stated conditions for a stated period of time.” [6] They believe that building a team of system admins that, as a function of their job, have to physically touch most of the machines they are updating or fixing is inherently inefficient and does not scale. They call this manual effort “toil,” and they describe it as anything that is manual, repetitive, tactical and devoid of any enduring value. They also believe that the tradition of separating teams into two buckets, development and operations (ops), automatically creates conflict around common vocabulary, assumptions about risk, technical solutions and product stability requirements. This causes gaps in trust and respect between groups. Their solution is to combine both tasks into one SRE. The result is a team of people who hate doing stuff manually and who develop software to maintain their systems.

The “E” in SRE stands for engineering. The SRE mantra is that the team only spends 50 percent of their time touching machines. The rest of their time is dedicated to developing automated systems, and they focus on availability, latency, performance efficiency, change management, monitoring, emergency response and capacity planning. If the SRE team’s manual workload exceeds 50 percent, they kick the application in question back to the development manager for review. When they do have to manually touch a machine, they are fanatics about post mortem analysis. SREs use that analysis to automate the fixes to the system in the future.

The business responsible for the service or application determines the availability target. If, for example, the availability target is 99.99 percent, then the SRE error budget is .01 percent. That means that the SREs are not driving for zero outages; they are driving to stay underneath the error budget. It also means that, if they are within that error budget, they can use that time to maintain, update and otherwise increase the maximum feature velocity. This means that the SRE team is incentivized to push new code as quickly as possible. Systems reliability is not about uptime, per se, but rather it is about service request success rate.

Google’s Infrastructure

SREs define a machine as a piece of hardware or a virtual machine. They define a server as a piece of software that implements a service running on one of those machines. They have built their own server, called “Borg,” that orchestrates resource allocation across other services and machines. Borg is “a system that moved away from the relatively static host/port/job assignments of the previous world, toward treating a collection of machines as a managed sea of resources.” Groups of machines make up a rack. Racks stand in a row. One or more racks form a cluster. Multiple clusters exist in a data center. Multiple data centers close together form a campus. They have developed their own virtual switch service that enables very fast communication between machines in a given data center that can handle tens of thousands of ports and averages 1.3 Pbps bisection bandwidth.

According to Professor Jun Zhang from the Laboratory for High Performance Computing & Computer Simulation at the University of Kentucky, "The bisection bandwidth is the minimum volume of communication allowed between any two halves of the network." [8] They connect data centers with another service called “B4” that provides a globe-spanning backbone. SREs use tools like Puppet, Chef, CFEngine and even Perl to automate their work. Google’s infrastructure is not just automated; it is autonomous. They have developed an entire philosophy on monitoring, alerting, product testing, troubleshooting, emergency response, regression testing, preventing cascading failures, scheduling, data process pipelines, and data integrity, using white box and black box data. They don’t alert on potential problems, like the database might be slow. They alert on what’s broken and why. They monitor latency, traffic, errors and saturation. The production tests don’t find misconfigurations; they fix them on the fly. Consequently, they have to automate the management of their source code too. They use a common database of rogue bugs caused in earlier releases to make sure they do not reappear in future releases (regression testing).

I am amazed.

Security

The only knock on the book is that the SRE authors who wrote it consider SRE work to be separate from security work and do not address specifically how security plays in their environment. They do mention in passing that they use system monitoring to detect breaches and that they have worked hard to reduce SRE privileges to the absolute minimum. They made sure that SREs do not have god-like powers by replacing SSHD use with an RPC-based local admin daemon use. In other words, they can’t make global changes to the system. They get authority to change local systems, and they run every commit through a gated system in order to make it very difficult for SREs to exceed their authority. Gated operations include who approves source code, specifying the actions to be performed during the release, who can create a new release, who approves initial integration, who can deploy a new release, and who manages changes to a project’s build. The authors do mention a service that scans for application vulnerabilities and the security tasks that happen during launch coordination: security design review, security code audit, spam risk, authentication requirements, use of SSL, and access control against various blacklists. They point to a white paper written by Rory Ward and Betsy Beyer that explains the Google security philosophy, but it is nothing more than Google’s Version of Zero Trust. [9] [10] I would have liked to see how the SRE philosophy interacts with the security team.

Conclusion

When I read The Phoenix Project, I predicted that DevOps was the greatest disruptor to the IT sector since the personal computer. But it wasn’t until I read Site Reliability Engineering: How Google Runs Production Systems that I realized how early adopters of the DevOps concept, companies like Google, Amazon, Netflix, Salesforce and Facebook, used it to scale exponentially compared to their competitors and peers. Google’s version of it is called Site Reliability Engineering, and some of their engineers produced this book as a how-to manual to show the rest of us the way forward. My only note of reservation is that this beautiful and complex SRE process does not automatically include the security function. I would have liked the authors to address this omission more completely. That said, this is an important book and should be part of the Cybersecurity Canon. It shows the way that we all should be thinking about deploying and maintaining our IT and security systems. I believe that the business landscape has about a five-year window to get on board with the DevOps concept and build their own SRE systems. The ones that get there ahead of their competitors will crush their competition. Those that don’t will vanish.

Sources

[1] "Cybersecurity Canon: Essential Reading for the Security Professional," by Palo Alto Networks, last viewed 5 July 2017,

https://www.paloaltonetworks.com/threat-research/cybercanon.html

[2] "The Cybersecurity Canon: The Phoenix Project: A Novel About IT, DevOps, and Helping Your Business Win," by Gene Kim, Kevin Behr, and George Spafford, IT Revolutions Press,

10 January 2013, last visited 7 September 2017,

https://www.goodreads.com/book/show/17255186-the-phoenix-project?ac=1&from_search=true

[3] "10 companies killing it at DevOps," by Christopher Null, TechBeacon, last visited 8 September 2017,

https://techbeacon.com/10-companies-killing-it-devops

[4] Cybersecurity Canon Book Review of "The Cybersecurity Canon: The Phoenix Project: A Novel About IT, DevOps, and Helping Your Business Win," by Gene Kim, Kevin Behr, and George Spafford, book review by Rick Howard, Cybersecurity Canon, 21 October 2016, last visited 2 September 2017,

https://www.paloaltonetworks.com/blog/2016/10/phoenix-project-novel-devops-helping-business-win/

[5] "The Phoenix Project" discussion – Cybersecurity Canon 2017," by Palo Alto Networks, interview by Rick Howard, interviewee Kevin Behr, 7 June 2017, last visited 7 September 2017,

https://www.youtube.com/watch?v=ygSvdv-QpUM

[6] "Site Reliability Engineering: How Google Runs Production Systems," By Betsy Beyer, Chris Jones, Jennifer Petoff, and Niall Richard Murphy, Google Landing Page, O'Reilly Media, 16 April 2016, last visited 2 September 2017,

https://landing.google.com/sre/book.html

[7] "THE BIRTH OF GOOGLE," by JOHN BATTELLE, Wired Magazine, 1 August 2005, last visited 7 September 2017,

https://www.wired.com/2005/08/battelle/

[8] "Parallel and Distributed Computing; Chapter 3: Models of Parallel Computers and

Interconnections," Professor Jun Zhang, University of Kentucky, last visited 8 September 2017,

https://www.cs.uky.edu/~jzhang/CS621/chapter3.pdf

[9] "BeyondCorp: A New Approach to Enterprise Security," by Rory Ward and Betsy Beyer, Google, December 2014, last visited 8 September 2017,

https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43231.pdf

[10] "ZERO TRUST NETWORK ARCHITECTURE WITH JOHN KINDERVAG – VIDEO," by John Kindervag, Palo Alto Networks, last visited 8 September 2017,

https://www.paloaltonetworks.com/resources/videos/zero-trust.html