Image

Why Netflix Chose Chaos
The concept was formalized in 2010 as a response to serious downtime incurred prior to Netflix transitioning from a single-source, on-premises network to a cloud-based global distribution model. Due to a corruption in one of their primary databases, the company experienced a three-day outage that left millions of customers without services. When a single hour of downtime can cost the average corporation $100,000 or more, even a five-minute outage is unacceptable. It not only affects reputations and bottom lines, but it also leaves your networks more vulnerable to attacks and data leaks. In preparation for the move toward decentralized global networks, the team at Netflix created Chaos Monkey. This tool was designed to cause random systematic failures at unexpected times and locations in an effort to determine if the systems they designed could withstand extreme conditions. The logic runs that if our network can handle this, it can surely handle that. In the nearly seven years since Netflix pioneered chaos as a sort of software engineering DevOps tool and released their monkey into the open-source marketplace, it has become a standard testing protocol for companies like IBM, Google, and Amazon.Chaos in Action: The Principles of Chaos Engineering
A simple way of looking at chaos engineering is to think of it as a network inoculation. When humans are injected with a virus, their bodies naturally adapt a response that fights off future infections. Chaos engineering can work hand-in-hand with cyber security advancements that utilize machine learning in an effort to anticipate, re-calibrate, and counteract internal and external threats.How Does Chaos Engineering Work?
One goal of chaos engineering is to overcome the biases of those who are new to distributed networking by directly addressing certain fallacies. These are that:- Networks are reliable and secure
- There's zero latency
- Bandwidth is infinite
- Topology is unchanging
- There's only one administrator
- Transportation costs nothing
- Networks are homogeneous
- Define a "steady state" of measurable outcomes that indicate normal system performance.
- Assume that this steady state will continue in both control and challenge environments.
- Introduce variables that mimic real world issues like server crashes, malware injections, dropped network connections, and hardware failures.
- Seek to disprove the original hypothesis by looking for differences in network behavior between the control and challenge groups.
Advantages of Chaos Engineering for Enterprises
Implementing tests under chaotic conditions offers benefits beyond lab analysis. Technicians are able to obtain deeper insight into systematic vulnerabilities, which leads to fewer adverse incidents and outcomes and improves time to market (TTM). Businesses are able to proactively mitigate revenue losses, reduce downtime, and initiate more meaningful IT and engineering training programs. Most importantly, it allows developers, engineers, and businesses to support more reliable service development and delivery. This increases customer satisfaction by ensuring uninterrupted service availability. If you want to protect your networked systems, it pays to incorporate chaos engineering standards as part of an overall performance and network security mitigation plan.Final Thoughts
The more complex and distributed our networks become, the louder the call for software developers and engineers to devise meaningful testing protocols under a variety of conditions. By incorporating chaos engineering, we're able to better prepare for the unexpected without disrupting vital systemic function. This improves overall performance and enhances system security in virtually any challenging circumstance or environment.Image
