Day 19: Chaos Engineering – Building Resilient Systems

Welcome to Day 19 of the Zero to Platform Engineer in 30 Days challenge! 🚀 Today, we’ll explore Chaos Engineering, its origins with Netflix’s Simian Army, and modern tools like LitmusChaos, Gremlin, and LambdaChaos that help test system resilience.

What Is Chaos Engineering?

Chaos Engineering is the practice of intentionally injecting failures to:

Test system resilience under real-world failure conditions.
Identify weaknesses before they impact users.
Ensure high availability and reliability in production.

🎯 Key Principles:

Define a steady state for your system.
Introduce controlled failures (CPU stress, pod deletion, network latency).
Observe the impact and automate failure detection.
Improve system self-healing capabilities.

💡 Instead of waiting for an outage, Chaos Engineering helps teams proactively prevent failures!

The Origins of Chaos Engineering – Netflix’s Simian Army

failures. Their Simian Army introduced controlled chaos to their production environments.

Simian Army Components:

🐵 Chaos Monkey → Randomly terminates instances to test auto-scaling & failover.
🐢 Latency Monkey → Injects network delays to simulate slow responses.
🦍 Chaos Gorilla → Simulates AWS region failures to test disaster recovery.
🛡️ Security Monkey → Identifies security misconfigurations.
🧹 Janitor Monkey → Removes unused cloud resources to optimize costs.

📌 Netflix open-sourced Chaos Monkey, allowing other companies to adopt resilience testing!

Modern Chaos Engineering Tools

While Netflix’s Simian Army was groundbreaking, modern tools like LitmusChaos, Gremlin, and LambdaChaos provide more control and integrations.

LitmusChaos (Open Source)

Kubernetes-native Chaos Engineering tool.
Supports experiments like pod deletion, node failure, and network loss.
Integrates with Prometheus, Grafana, and ArgoCD.

📌 Who uses it? • CNCF supports LitmusChaos as the standard Chaos Engineering tool for Kubernetes.

Installation:

helm repo add litmuschaos https://litmuschaos.github.io/litmus-helm/
helm repo update
helm install litmus litmuschaos/litmus --namespace litmus --create-namespace

Gremlin (Commercial)

Enterprise-grade Chaos Engineering tool.
Supports CPU spikes, memory exhaustion, latency injection, and process killing.
Provides detailed attack simulations and observability integrations.

📌 Who uses it?

Companies like LinkedIn, Expedia, and Twilio use Gremlin to run controlled chaos in production.

Example: Simulate high CPU usage in a container:

gremlin attack container cpu --targets "nginx"

Activity for Today

Research litmuschaos and gremlin.
Compare and contrast the two tools.
Think of a use case for each tool.

What’s Next?

We’ll recap the best practices for Observability and build a basic observability stack using logs, metrics, and traces.

👉 Check it out here: Zero to Platform Engineer Repository

Feel free to clone the repo, experiment with the code, and even contribute if you’d like! 🚀

Follow the Series!

🎉 Don’t miss a single step in your journey to becoming a Platform Engineer! 🎉

This post is just the beginning. Here’s what we’ve covered so far and what’s coming up next:

👉 Bookmark this blog and check back every day for new posts in the series. 📣 Share your progress on social media with the hashtag #ZeroToPlatformEngineer to connect with other readers!

Day 19: Chaos Engineering – Building Resilient Systems

What Is Chaos Engineering?

The Origins of Chaos Engineering – Netflix’s Simian Army

Modern Chaos Engineering Tools

LitmusChaos (Open Source)

Gremlin (Commercial)

Activity for Today

What’s Next?

Follow the Series!

Read Next

Day 22: Building Developer Portals with Backstage

Day 20: Recap – Creating a Basic Observability Stack

Subscribe to Alex Parra Newsletter