Chaos Engineering: Building Confidence in System Behavior through Experiments
Book written by Casey Rosenthal
Book review by Naveen Zutshi
I would recommend Chaos Engineering for the Cybersecurity Canon.
The concept of distributed computing is not new, having been introduced as early as ARPANET (The Advanced Research Project Agency Network) in the 1960s. What is different today are changes such as microservices-based software development, auto scaling, highly decoupled architectures, agile teams, and service oriented architectures. Under this distributed workload, your system needs to be resilient to service failures and network latency spikes. Additionally, distributed systems at scale become complex for any architect to understand, and relying on basic testing principles is not sufficient.
The authors of Chaos Engineering make a compelling case that if you have figured out service failures and network latency, you need to take the next step of either designing and implementing carefully planned experiments in production environments to determine new knowledge about the underlying state of the system or producing new avenues of exploration for your teams. In that respect, chaos engineering diverges from testing, since testing is often conducted earlier in the development life cycle and answers binary questions about the existing state. Chaos engineering will generate new answers about how the state of systems reacts to the wide area of experiments such as a region-wide outage, latency between services causing system wide outages and function-based chaos, among others.
My overall sense after reading the book is that chaos engineering is a nascent field, and for enterprises struggling with basics like automated monitoring and building resilient systems, chaos engineering will not help. It is ideal for organizations that have conquered the common use cases of distributed architecture complexity. The other thought I have is that, in some ways, chaos engineering is similar to deep penetration testing or red team testing by security teams because previously unknown threat vectors are being discovered, and experiments in production are based on the hypothesis of exposing vulnerabilities within the state of the security system.
The difference between testing and chaos engineering:
- Chaos engineering is a practice of generating new information whereas in testing an assertion is made. For example, given the system conditions, the system will emit specific output. Tests are typically binary and, strictly speaking, don’t generate new information about the system; instead, they assign valence to a known property of it. Chaos engineering on the other hand is a form of experimentation that generates new knowledge about the system and often suggests new avenues of exploration. Examples of chaos engineering experiments include: simulating the failure of an entire region or data center; injecting latency between services for a select percentage of traffic over a predetermined period of time; function-based chaos; and maxing CPU on an elastic search cluster.
Who should undertake it and conditions for success:
- Chaos engineering is primarily applicable to distributed scaled-out systems.
- Is your system resilient to real-world events such as service failures and network latency spikes?
- A comprehensive monitoring system is needed that provides full visibility into your system’s behavior; otherwise you won’t be able to draw conclusions from your experiments.
Optimization in distributed systems: performance, availability, and fault tolerance
Velocity of feature development - describes the speed with which engineers can provide new, innovative features to customers
Operate under a micro services architecture results in higher feature velocity at the expense of coordination. Chaos engineering comes into play here by supporting high velocity, experimentation, and confidence in teams and systems through resiliency verification
Distributed system at sufficient scale becomes too complex for any one human to understand resulting in reduced need for architects who have the master plan. Ignore comprehensibility as a design principle. The system as a whole should make sense, but subsections dont have to make sense
Request / response chaos - spaghetti call graph and the chaos inherent - classical testing is insufficient since it can only tell us when an assertion is true or false. We need to discover new properties
For example, a "bullwhip effect” in Systems theory i.e., a small perturbation in input starts a self-reinforcing cycle that causes a dramatic swing in output. In this case, the swing in output ends up taking down the app. Each microservice could behave rationally, however, taken together under specific circumstances can result in undesirable system behavior.
Principles of Chaos: Chaos engineering as an experimental discipline
How would your system interact if we injected chaos into it? We would need an empirical approach to system behavior since a theoretical approach doesn’t exist. For example, failure injection testing (FIT) adds a failure scenario to the request header of a class of requests at the edge of service. As these requests propagate, injection points between microservices will check for the failure scenario and take action based on the scenario:
- Hypothesize about a steady state: Use systems metrics and detailed instrumentation to help troubleshoot performance and, in some cases, functional bugs. Which metrics you measure is important. For example, the SREs at Netflix are more interested in measuring SPS (metric video-stream starts per second) than just CPU utilization because customer satisfaction is highly correlated to customers hitting the play button on their video streaming devices. Measure business metrics not just system metrics, which are often in a tolerance range or threshold. The test with chaos engineering experiments is to deliberately cause a noncritical service to fail in order to verify that the system degrades gracefully. You can also redirect incoming traffic between regions and verify that the steady state is not breached, or you can perform canary analysis.
- Vary real-world events: Set up host instance termination because it happens frequently in the wild and the act of turning off a server is cheap and easy. Since real-world events are numerous, look to experiment on events that have a high frequency of occurrence in the real world and where the cost to experiment is low. Another potential test is faulty code deployment and its impact.
- Run experiments in production: A classical tenet in testing is that the cost of rework increases as we discover bugs late in the development life cycle. However, with chaos engineering, the strategy is reversed. You want to run experiments as close to the production environments as possible. Rather than code correctness—which classical testing discovers—chaos engineering is focused on the behavior of the overall system. This includes code, state and input, and other people’s systems, which leads to system behaviors that are difficult to foresee. Stateful systems like databases don’t behave in test system the way they do in production, thus regression testing is often insufficient for identifying key issues in distributed systems. You should be able to abort the experiment.
- Automate experiments in production: Start with manual and one-off experiments, since you want to have an appropriate level of apprehension. Ensure care so the experiment runs correctly and minimizes the blast radius, especially in production. Once the experiment is successful, automate to run continuously since you can measure against the dynamic and ever-changing production environment.
- Minimize blast radius: The professional responsibility of a chaos engineer is to understand and mitigate production risks. A wel-designed experiment will prevent big production outages by causing only a few customers a small amount of pain. Scale out experiments as they begin to succeed to involve more users and more of the production environment.
Chaos in practice
There is limited implementation examples of Chaos Engineering, though some great examples are shared viz., financial, business to commerce, and several large b2b organizations. Use the disciplined approach of picking a hypothesis, choosing the scope, identifying the metrics, informing the organization, running the experiment, analyzing results, increasing scope and then reiterating it.
Finally, if you become sophisticated in the practice of chaos engineering, you can start measuring the maturity of your chaos experiments by using sophistication and adoption metrics. Sophistication measures the validity and safety of chaos experiments and can be measured as elementary, simple, sophisticated or advanced. Adoption measures the depth and breadth of chaos experimentation coverage and can be measured on a maturity model of: in the shadows; investment; adoption and cultural expectation.
Still a very young field, the adoption of chaos engineering is nascent and it has its share of critics. It is ultimately a means to an end, with the aim to improve the production environment itself from real-world issues. While doing it in combination of proactive failure testing and post-incident reviews is beneficial, Chaos engineering can wreak havoc on state systems in production, if not carefully constructed, and it is very difficult to roll back.
Chaos Engineering: Building Confidence in System Behavior through Experiments is an easy read, and the parallels to penetration testing conducted by red teams are striking. It is somewhat light on the details about how to build carefully crafted experiments; therefore, I would recommend further reading on Chaos Monkey, failure injection testing, etc. In crawl, walk, run phases of enterprises moving from a monolith state of applications to one of microservices-based architecture, chaos engineering represents activity conducted in the run phase. If possible, enterprises should be identifying the experiments with lowest effort and reduced blast radius of impact as a set of training wheels to learn the practice of chaos engineering.