Testing in production (TiP) is gaining steam as an accepted practice in DevOps and testing communities as a way to reduce potentially costly defects, but no amount of preproduction QA testing can foresee all the possible scenarios in your real production deployment. The prevailing wisdom is that you will see failures in production; the only question is whether you'll be surprised by them or inflict them intentionally to test system resilience and learn from the experience. The latter approach is chaos engineering.
The idea of the chaos-testing toolkit originated with Netflix’s Chaos Monkey and continues to expand. Today many companies have adopted chaos engineering as a cornerstone of their site reliability engineering (SRE) strategy, and best practices around chaos engineering have matured.
The time is right to gain a comprehensive understanding of this approach. And to help you do that, we gathered over 30 of the most popular and well-reviewed links to tools and resources on chaos testing and chaos engineering, neatly grouped and categorized. Read on to see if this relatively new strategy is right for you.
Test in production
Test faster and smarter by testing in production
The advice from SauceLabs—aimed at TiP beginners—is to intentionally test in production. The real-world feedback it gives you is the perfect supplement to your internal QA process.
Testing in production: Yes, you can (and should)
"Only production is production," says Charity Majors, CEO at monitoring-tool vendor Honeycomb, and you are de facto testing in production every time you deploy code there. Lean into it and practice failure regularly so you can get better at handling it.
TestOps #2—Testing in production
Here's how you can implement TiP through canary releases, blue-green deployments, slow rollout techniques such as controlled test flight, A/B testing, synthetic user/bot-based testing to generate production load, fault injection testing/chaos engineering, and dogfooding.
Every release is a production test
When the Twitter team started on the path to TiP back in 2010, it ran a cost-benefit analysis and concluded that TiP would have high value, but at a high cost (risk of outage), so it came up with these risk-mitigation strategies.
Tie your production tests into your CI/CD pipeline
Testing in production: rethinking the conventional deployment pipeline
The Guardian integrates its production tests into the CI/CD pipeline, linking test results directly to GitHub pull requests to complete the feedback loop for developers. Its RiffRaff deployment tool and Prout pull-request feedback tool are both available as open source.
Salesforce testing best practice: why you should regularly run production tests
Salesforce advocates re-running tests in production on a regular cadence (not only at release time) to uncover failures due to changes in the system early, rather than after a later deployment. It provides a framework for doing so in its Gearset testing system.
Minimize the negative impact of production tests
Scientist: Measure twice, cut over once
Use GitHub’s Scientist framework to deploy new releases and send production requests down a new path to potentially uncover new bugs while also preventing end users from experiencing errors due to those bugs. Scientist serves the correct output to your users, compares old (control) and new (experimental) outputs, and alerts you if there's a mismatch. The two-year-old framework has been ported to multiple languages.
Move fast and fix things
GitHub uses Scientist for its own releases. Here it shares the details of one release experiment where the team found and fixed serious issues in its merge code over four days of testing in production—without affecting its users.
Understand the principles of chaos engineering
Principles of chaos engineering
This community-maintained document is a great first introduction to chaos engineering. It defines "chaos engineering"—experimentation on a system to uncover its weaknesses—and lists the principles agreed upon by the chaos-engineering community.
Chaos engineering upgraded
The principles of chaos engineering originated at Netflix, which documented them during the development of Chaos Monkey, its open-source tool for random fault injection. In 2015, the Netflix team augmented its chaos toolkit with Chaos Kong, a tool that mimics the outage of an entire AWS region. This post describes Netflix's Chaos Kong exercise and another experiment, with the Subscriber service.
Breaking things on purpose
Breaking things on purpose is preferable to being surprised when things break, says Mathias Lafeldt, infrastructure developer at Gremlin. When you do it on purpose, you can test breaking things at a time and place that is convenient, he explains in this blog post.
Chaos testing—Preventing failure by instigation
Learn the definition of "chaos testing" from Mark Harrison, senior consultant at Cake Solutions, and get some thoughts from the Chaos Community Day conference.
Practice chaos engineering techniques
Chaos engineering 101
Get started with chaos experiments, from principles to specific steps, with this article by Mathias Lafeldt.
The discipline of chaos engineering
The what, why, and how of chaos engineering as described by chaos-as-a-service provider Gremlin.
Planning for chaos with MongoDB Atlas: Using the "test failover" button
Here's an example of how to do chaos testing for MongoDB. The developers of MongoDB made it easier for users by providing a special “Test Failover” feature. This software comes with built-in chaos buttons.
A primer on automating chaos
Here's a walk through the progression of automation for chaos engineering. Don’t worry; you don’t have to automate it all at once.
The limitations of chaos engineering
While chaos engineering is a great tool for improving the resilience of your system, it is not a panacea. Here's where it's a fit—and where it's not.
Use fault injection and chaos tools
Chaos toolkit
This resource provides a command-line interface that encapsulates chaos-engineering workflow, along with tutorials.
The Netflix Simian Army
The well-known “Chaos Monkey” and the rest of Netflix's Simian Army have been used since 2011 to randomly break production systems and see if the are fault-tolerant.
Using Chaos Monkey whenever you feel like it
The original purpose of Chaos Monkey was to test resilience by killing off parts of your production system at random; this engineer uses it to kill Amazon EC2 instances in a controlled way.
FIT: Failure injection testing
Netflix developed the FIT framework in 2014 to give its engineers more control over the chaos.
From chaos to control—Testing the resiliency of Netflix’s content discovery platform
This is an example of using Latency Monkey (from the Simian Army suite) and FIT to test Netflix’s Merchandise Application Platform.
Automated failure testing: Training smarter monkeys
Netflix continued iterating on its toolkit with this 2016 prototype tool based on Molly, a fault injector that uses request lineage data.
How we break things at Twitter: Failure testing
Twitter’s framework for injecting faults into its production system (power loss, network loss, service unavailability) consists of mischief, monitoring, and notifier modules tied together with a Python library. Sadly, it is not open source, but a good architectural overview is provided.
Systematic resilience testing of microservices with Gremlin
This open-source Python framework from IBM for fault injection testing of microservices should serve as a companion to—not a replacement for—Chaos Monkey.
ChaosCat: Automating fault injection at PagerDuty
ChaosCat is not open source, but serves as an inspiration. PagerDuty implemented it as an always-on service, with a Slack bot interface for one-off invocation. As a service, it continuously throws randomly chosen attacks at PagerDuty’s hosts.
Pumba—Chaos testing for Docker
Pumba is a new Chaos Monkey-like tool for resilience testing Docker containers.
Run game-day exercises
Fault injection in production: Making the case for resilience testing
This seminal 2012 paper from Etsy lays out the argument for testing in production with intentional fault injection, and provides a pattern for constructing a game-day exercise. The exercise helps the system learn from exposure, à la vaccination.
3 lessons learned from an Elasticsearch game day
Datadog describes how it ran a game-day event on its ElasticSearch cluster in order to learn which failure modes were easily handled, and which caused unexpected problems.
Game day exercises at Stripe: Learning from 'kill -9'
Stripe suggests that you stick to the simplest failure scenarios when starting out with game-day exercises. Its first choice was a basic “kill -9” on the primary node of a Redis cluster, which unexpectedly resulted in data loss. Here are the lessons learned.
Our first engineering game day
If you're new to game days and not ready to inflict potential pain on your high-value customers, consider this startup’s approach: It ran a game day in a staging environment instead of production. This also works well for intentionally exceeding the tolerance limits of your system, to train the team on incident response. (Full disclosure: I am the author of Quid’s game-day blog post.)
This way to more chaos
If you still can't get enough of chaos engineering and testing in production, you'll find additional resource lists on GitHub. Each has slightly different collections. The first resource is part of GitHub's series and includes more articles, tools, books, conferences, and blogs. The second is a curated list of resources on testing distributed systems, which includes resources on chaos engineering, game days, and more.
Those are my picks for the best resources on all things chaos engineering. If you recommend other resources on chaos engineering or TiP, let me know by posting them in the comments below.
Keep learning
Take a deep dive into the state of quality with TechBeacon's Guide. Plus: Download the free World Quality Report 2022-23.
Put performance engineering into practice with these top 10 performance engineering techniques that work.
Find to tools you need with TechBeacon's Buyer's Guide for Selecting Software Test Automation Tools.
Discover best practices for reducing software defects with TechBeacon's Guide.
- Take your testing career to the next level. TechBeacon's Careers Topic Center provides expert advice to prepare you for your next move.