Shifting from Chaos to Controlled Reliability Testing

admin 4 hours ago

0 0 2 minutes read

Shifting from Chaos to Controlled Reliability Testing

Chaos engineering, the practice of injecting failure to test the robustness of a system, already exists. In business today, the focus has shifted from chaos to reliability testing at scale.

“Chaos testing, chaos engineering is a misnomer,” Kolton Andrus, founder and CEO of Gremlin, told SD Times about the name he founded the company. “It’s been cool and hot for a while, but most companies don’t care about chaos. They like to be honest.”

For large enterprises, disaster recovery testing—such as data center evacuation or cloud region failure testing—is a major undertaking. Customers have spent hundreds of man-months of engineering to put these tests together, resulting in extraordinary tests. This leaves organizations vulnerable to risks that only arise under load.

The new focus is on building a framework to make these tests repeatable and easy to run across the company with a few clicks of buttons. Andrus noted that an important aspect is security, where Gremlin integrates system health signals to ensure that if something goes wrong, changes are cleaned up, rolled back, or rolled back quickly, preventing real risk to customers.

How to Test Against a Cloud Data Center

A key question for any company is how to simulate a major failure—such as an AWS data center outage. “Ultimately, we make some disruptions in production because that’s what you’re testing,” Andrus explained. Gremlin tools can actually create a network partition around a data center or virtual space. “So if I have three locations, I can make one location a true separate brain. It can see itself, it can only talk to itself.” By performing testing at the network layer, he said, organizations benefit from having the ability to quickly reverse things if things go wrong. “We don’t make an API call to AWS and say ‘Shut down Dynamo, and delete these buckets.’ Or, shut down all my EC2 instances in this environment for an hour, because that’s hard to roll back and you might get stuck with the AWS API if you roll it back.” To address this issue, Andrus said Gremlin was designed to be redundant from the ground up, so if one location’s data centers fail, the application can continue to run elsewhere.

Although the direct revenue impact—calculated by looking at the relative number of expected orders versus the decline in actual orders—is the level of outage costs, the total impact is much greater. This involves huge engineering costs: teams spend days discovering, fixing, testing, and then getting the core, followed by meetings and follow-up work.

When a test fails, repairs are guided by reliable intelligence, drawn from millions of previous tests performed on Gremlin to identify possible causes and provide concrete, concise recommendations on how to fix problems.

The biggest risks are usually not the network itself, but the resulting failures in microservices. Subtle points like operating in multiple regions but relying on a database in only one location, or not distributing the status between locations, can cause problems such as lost customer carts or transactions. The company’s extensive testing focuses on the “attachment and all the wires” that connect services—DNS, traffic routing, and distribution of critical data across locations.

Ultimately, Andrus said, it’s about “finding those risks and fixing them so that if something real happens, you’re not surprised by some behavior.”