Nobody want to do go their system down in their production system .
Chais Engineering is one of the best expirement to buld a confident system by experimenting the system with its core principles and tools
Agenda of this post
1.What is Chaos Engineering
2.Principles of Chaos Engineering
3.Tools used in an AWS type of Environment
4.OpenSource Tools for traditional like systems
In short Chaos Engineering is an experiment conducted in production like environment in order to build confidence on the system during unexpected turbulent conditions in the system .
This is the principle of Chaos engineering based on the engineers from Netflix based on their experience with which the applications they created in AWS Environment .
Chaos is also a discipline to address the system availability and affect :
people and Culture
–> Noting breaks in Production if it happened how will the people address it
–> Application and platforms should be enough capable support in hand in hand to support the situation
Using this experiment we will be able to identify the loop wholes in the system and fix it for high availability in a distributed system.
Principles of Chaos Engineering
1.Build hypothesis around the system
Noting down the stady state of the system behaviour like if we did this what will happen if that is done what will heppen to my systems .
- vary real world events
ex: plug out the system from internet to make sure the server goes down, increasing traffice to the system ,adding some not expected events in order to distrub the system to simulate the failure of application
3.Run Experiments in Production system
Until unless you cant run this in production we cant simulate chaos engineering
4.Automate experiments to run continuously
if we automate these every month or half yearly so that we can find un-faults from the application
5. Minimise the blast radius
Identifying the impact and reduce the impact of failure . This will help us in identifying the failure and minimise the impact to the live traffic.
Let take AWS as a platform for testing these Chaos Engineering
let us say we have a critical applicaiton that is build and stored in a single AWS avalailabilty regios if the service goes down then other services in the same regios can server the live traffic . If the entire region goes down the sytem will completely down to server public .
if you conduct enough experiments then we will able to identify the system behaviour and plan accordingly .
in this case let have the same applicaiton availabel in multiple regions in AWS even if one or two regiosn completely goes down still the site is up and running fine .
Let see what are the opens source tools available in market and see how Netflix nd facebook run these experiments .Netflix has something call SIMIAN ARMY .basically they have army of tools that they run these tool every often . they have named with concept called monkey.
Every thread that tries to bring down the Bahubali statue from the picture is a kind of Chaos Tools .let see some tools here
First tool that build by Netflix is Chaos Monkey . what is does it it goes to randomly kill the service not the server and see what happens to the system behaviour . based on the success of Chaos monkey .Netflix build different tools
Next one is Chaos gorilla it will kill entire Availability zone [ ex : East ]
Next one Chaos Kong to kill complete region
Latency Monkey ..which will introduce artificial delay to the requests and see what will happens
Performs health checks, by monitoring performance metrics such as CPU load to detect unhealthy instances, for root-cause analysis and eventual fixing or retirement of the instance.
Identifies and disposes unused resources to avoid waste and clutter.
A tool that determines whether an instance is nonconforming by testing it against a set of rules. If any of the rules determines that the instance is not conforming, the monkey sends an email notification to the owner of the instance.
Derived from Conformity Monkey, a tool that searches for and disables instances that have known vulnerabilities or improper configurations.
A tool that detects problems with localization and internationalization (known by the abbreviations “l10n” and “i18n”) for software serving customers across different geographic regions.
A small Java library for testing failure scenarios in JVM applications. It works by instrumenting application code on the fly to deliberately introduce faults such as exceptions and latency.[
ChaosMachine is a tool that does chaos engineering at the application level in the JVM. It concentrates on analysing the error-handling capability of each try-catch block involved in the application by injecting exceptions.
Facebook has Facebook Storm with which facebook identifies if there issue with data centre and if it goes down what will happened to Facebook
AWS has AWS Gamedays on this particular day all these servers are killed and tested as part of the chaos engineering
Days of Chaos
Inspired by AWS GameDays to test the resilience of its applications, teams from Voyages-sncf.com participated in a Day of Chaos. Every 30 minutes, operators simulated failures in pre-production. Teams earned points based on detections, diagnoses, and resolutions. This type of gamified event helps to introduce development teams to the concept of resilience.
Presented at the 2017 DevOps REX conference the concept is presented on the site http://days-of-chaos.com in order to collect the other experiments.
ChaoSlingr is the first Open Source application of Chaos Engineering to Cyber Security. ChaoSlingr is focused primarily on performing security experimentation on AWS Infrastructure to proactively discover system security weaknesses in complex distributed system environments. Published on GitHub in September 2017.
The Chaos Toolkit was born from the desire to simplify access to the discipline of chaos engineering and demonstrate that the experimentation approach can be done at different levels: infrastructure, platform but also application. The Chaos Toolkit is an open-source tool, licensed under Apache 2, published in October 2017.
Mangle enables you to run chaos engineering experiments seamlessly against applications and infrastructure components to assess resiliency and fault tolerance. It is designed to introduce faults with very little pre-configuration and can support any infrastructure that you might have including K8S, Docker, vCenter or any Remote Machine with ssh enabled. With its powerful plugin model, you can define a custom fault of your choice based on a template and run it without building your code from scratch.
Chaos Mesh is an open-source cloud-native Chaos Engineering platform that orchestrates chaos experiments in Kubernetes environments. It supports comprehensive types of failure simulation, including Pod failures, container failures, network failures, file system failures, system time failures, and kernel failures.
Netflix is the creator of the chaos engineering concept .that is why they have tons of tools . Every one can plan chaos in their organisation to build the confidence on their system in order to tackle any kind of situation in production .
Some of the generic experiments are shown below
- Simulate one of the server cpu to 100% and see what happens to the public requests
- Simulate logs files system to go full shortly and not the behaviour
- Inject a delay in the request time to reach the server and note the behaviour
- Many more such and for these many more tools are available.
Let’s keep exploring more to build a confident system .