Skip to content ↓ | Skip to navigation ↓

What if your job was to break things repeatedly in order to make them work better? Sounds like the dream of every curious six-year old, but it’s actually an emerging software engineering trend based in the transition from devops to devsecops. It’s designed to test systematic limitations with the goal of improving security and performance under any circumstances.

The term is chaos engineering. It works on the premise that, even when everything is functioning normally, the nature of modern distributed networks means there’s a chaotic element inherent in the system that can lead to unpredictable outcomes.

Chaos engineering is a proactive form of vulnerability management that tests networks under the most extreme possible circumstances in a controlled setting. The theory is that, when you prepare for the worst, you can easily cope with routine performance issues.

Permitting chaos to reign in a controlled environment allows engineers to use the data collected to design stronger, more resilient systems. And yes, it’s as much fun as it sounds.

Why Netflix Chose Chaos

The concept was formalized in 2010 as a response to serious downtime incurred prior to Netflix transitioning from a single-source, on-premises network to a cloud-based global distribution model. Due to a corruption in one of their primary databases, the company experienced a three-day outage that left millions of customers without services.

When a single hour of downtime can cost the average corporation $100,000 or more, even a five-minute outage is unacceptable. It not only affects reputations and bottom lines, but it also leaves your networks more vulnerable to attacks and data leaks.

In preparation for the move toward decentralized global networks, the team at Netflix created Chaos Monkey. This tool was designed to cause random systematic failures at unexpected times and locations in an effort to determine if the systems they designed could withstand extreme conditions. The logic runs that if our network can handle this, it can surely handle that.

In the nearly seven years since Netflix pioneered chaos as a sort of software engineering DevOps tool and released their monkey into the open-source marketplace, it has become a standard testing protocol for companies like IBM, Google, and Amazon.

Chaos in Action: The Principles of Chaos Engineering

A simple way of looking at chaos engineering is to think of it as a network inoculation. When humans are injected with a virus, their bodies naturally adapt a response that fights off future infections. Chaos engineering can work hand-in-hand with cyber security advancements that utilize machine learning in an effort to anticipate, re-calibrate, and counteract internal and external threats.

How Does Chaos Engineering Work?

One goal of chaos engineering is to overcome the biases of those who are new to distributed networking by directly addressing certain fallacies.

These are that:

  • Networks are reliable and secure
  • There’s zero latency
  • Bandwidth is infinite
  • Topology is unchanging
  • There’s only one administrator
  • Transportation costs nothing
  • Networks are homogeneous

Experiments in chaos engineering seek to address uncertainties of scale and outcomes in global network distributions. They’re designed to discover systematic weaknesses that affect performance and security. Most of these experiments follow four steps:

  1. Define a “steady state” of measurable outcomes that indicate normal system performance.
  2. Assume that this steady state will continue in both control and challenge environments.
  3. Introduce variables that mimic real world issues like server crashes, malware injections, dropped network connections, and hardware failures.
  4. Seek to disprove the original hypothesis by looking for differences in network behavior between the control and challenge groups.

Experiments are performed in a controlled setting in order to learn more about the nature of distributed network behavior and correct problems before they become systematic failures. At the heart of a good chaos experiment beats a devious brand of creativity that can create unexpected variables. Change one significant variable at a time as you try to disprove the hypothesis.

Think out of the box.

There are a variety of ways to create new and potentially destructive variables. You’re probably already using a VPN to encrypt your internet connection, but consider using its geo-location feature to hide your IP address. The software accomplishes this by connecting through intermediary computers and simulating a user from, say, Russia (or anywhere else). The point is to try and disrupt the steady state. Make a server crash. Kill a virtual machine. The more complex situations you throw at the network – and the growth of artificial intelligence allows for QUITE complex scenarios – the more confident you’ll be in system security and performance.

In order to assure that chaos engineering is applied under ideal circumstances with reliable data generation, the following principles should be pursued.

Form a hypothesis centered around steady state behavior that’s based on measurable output rather than system attributes. This will demonstrate that the system can stand up to unpredictable stress factors rather than simply confirming how it works.

Vary the real-world events to include incidents that result from hardware failure, software vulnerabilities, and events that don’t necessarily cause failures, like sudden traffic spikes or operational growth.

Experiment during production to include real traffic. This ensures authenticity and relevance during testing, making outcomes more meaningful to real-world applications than just relying on traffic and stress simulations.

Use automation to schedule and run continuous experiments. This is built into the standard procedure for chaos engineering in order to save time and cost over manual implementation.

Minimize the blast radius to promote containment and reduce network disruption.

Advantages of Chaos Engineering for Enterprises

Implementing tests under chaotic conditions offers benefits beyond lab analysis. Technicians are able to obtain deeper insight into systematic vulnerabilities, which leads to fewer adverse incidents and outcomes and improves time to market (TTM). Businesses are able to proactively mitigate revenue losses, reduce downtime, and initiate more meaningful IT and engineering training programs.

Most importantly, it allows developers, engineers, and businesses to support more reliable service development and delivery. This increases customer satisfaction by ensuring uninterrupted service availability.

If you want to protect your networked systems, it pays to incorporate chaos engineering standards as part of an overall performance and network security mitigation plan.

Final Thoughts

The more complex and distributed our networks become, the louder the call for software developers and engineers to devise meaningful testing protocols under a variety of conditions.

By incorporating chaos engineering, we’re able to better prepare for the unexpected without disrupting vital systemic function. This improves overall performance and enhances system security in virtually any challenging circumstance or environment.


Sam BocettaAbout the Author: Sam Bocetta is a freelance journalist specializing in U.S. diplomacy and national security, with emphases on technology trends in cyberwarfare, cyberdefense, and cryptography.

Editor’s Note: The opinions expressed in this guest author article are solely those of the contributor, and do not necessarily reflect those of Tripwire, Inc.