In the fast-paced world we’re living in, speed is the name of the game, whether you are running a 100m sprint at the Olympics or shipping software features to your customers. As an organization, you gain a competitive edge when you’re able to ship products and features faster to your customers.
Synergies between DevOps, microservices, and service mesh have enabled us to ship features faster. While these have sped up the development and enhanced software quality, they have their own set of challenges. Integration of all of these can add complexity which can further lead to service failures, latency, network failures, and much more that can impact user experience to a great extent.
This is exactly where chaos principles are helpful. By introducing controlled chaos into our systems, we can proactively identify weaknesses, strengthen our service mesh, and ultimately strike a balance between rapid software delivery and reliability.
In this post, we’ll look at LitmusChaos and how we can use this to perform chaos experiments on a service mesh like Istio to test its reliability. This blog post is based on the talk that I gave at Cloud Native Rejekts in Chicago in November 2023.
Principles of Chaos Engineering
Chaos engineering is not about causing random havoc within your systems; instead, it’s a disciplined approach to understanding how your systems behave under stress, and to ensure their reliability and resilience.
It involves the controlled introduction of disruptions or faults into a system to unearth the reaction of the system. If the system fails to survive the artificial disruption, we can find the reason for vulnerabilities and weaknesses. This enables organizations to detect and address potential issues before they share their product in real-world scenarios, enhancing the robustness of their applications.
The major principles of chaos engineering are as follows:
- Hypothesize system behavior under failure scenarios
- Define steady state as the baseline for comparison
- Simulate real-world conditions and failure scenarios
- Test in production-like environments
The intricate nature of service mesh deployments demands rigorous testing to ensure that they can withstand unexpected failures. Chaos engineering provides a structured way to introduce controlled disruptions into service meshes, uncovering vulnerabilities and strengthening them against unforeseen issues.
Based on these principles, we’ll design chaos experiments using LitmusChaos to test the reliability of your service mesh.
Testing reliability of Istio using LitmusChaos
For this post, we will explore the reliability of Istio, a popular service mesh, using LitmusChaos. We will set up Istio with the default Bookinfo application, enabling Kiali for monitoring. Then, leveraging LitmusChaos, we will design and execute chaos experiments to simulate a failure scenario.
Pre-requisites
All the required steps were followed as mentioned in the links above to have a cluster that is configured with the Bookinfo application, Istio service mesh, and Kiali dashboard along with LitmusChaos.
We can access the Bookinfo application using the following command.
$ echo "http://$GATEWAY_URL/productpage"
Navigate to the URL returned and you’ll see the product page of the Bookinfo application.
Similarly, you can use the following command to access the Kiali dashboard.
$ istioctl dashboard kiali
Navigate to the address returned by the above query.
By default, you will see the details of the default Bookinfo applications installed on the cluster along with other services and pods that are installed as part of Istio.
You can also run the following command in a new terminal to generate some load that will be directed to the product page.
for i in $(seq 1 100); do curl -s -o /dev/null "http://$GATEWAY_URL/productpage"; done
You can navigate to the graph section of the Kiali dashboard to find the visualization that displays the flow of requests through the system.
At this point we have Istio deployed along with the default Bookinfo application along with the Kiali dashboard.
We can use the minikube service list
command to access the LitmusChaos dashboard using the URL provided for chaos-litmus-frontend-service
.
It will ask for credentials to log in for the first time. The default credentials are:
Username: admin
Password: litmus
After you’ve logged in and changed the default credentials, you’ll be able to see the following dashboard.
You can create chaos experiments using the CLI as well, but we’ve used the dashboard to keep things simple.
Creating Chaos Scenario
Once the Chaos Center is installed and running, we will create a chaos scenario. This can be done in multiple ways, and we’re using the UI to create a scenario.
The first step is to create a scenario. LitmusChaos provides us with multiple ways to create a scenario:
- Create from a predefined template
- Create by cloning an existing scenario
- Create using experiments from ChaosHub
- Import a scenario using YAML
We’ll be creating a chaos scenario using an experiment from ChaosHub. ChaosHub is a collection of pre-defined experiments that you can use and modify according to your needs. In this case, we use the generic/container-kill
experiment.
On the next screen, we choose the target where this scenario will execute. In this, we provide the label for the Istio sidecar connected to our product-page pod.
After this, validate the settings and details and execute the scenario.
Executing Chaos Scenario
To execute the scenario, we’ll first send some traffic to our application. To do that, run the following code in a new terminal window.
for i in $(seq 1 100); do curl -s -o /dev/null "http://$GATEWAY_URL/productpage"; done
This will create some load and send it to our productpage endpoint. We can view the same on the Kiali dashboard as well.
You can see the flow of requests to every service/endpoint in the application along with the status on the right.
On the LitmusChaos dashboard, start the scenario and observe. You’ll see the flow turn red and the number of 5XX errors increase in the Kiali dashboard. When you access the endpoint on the browser, you’ll see a “no healthy upstream found” error.
This shows that our container kill experiment was successfully executed the effect of which we were able to validate on the Kiali dashboard as well as the application.
Similarly, one can create a chaos scenario with multiple experiments that will execute at the same time to test the resiliency of your application.
Next steps
What happens after a chaos scenario has been completed? You need to take corrective measures and fix the issues that these experiments were able to find. Chaos scenarios aren’t for one-time execution, they are to be carried out at regular intervals to ensure the resiliency and reliability of your application.
In this blog post, we saw how you can test the resiliency of your service mesh using a chaos engineering tool like LitmusChaos. You can virtually test any application and create any experiment to test your application - that’s the beauty of chaos engineering.
In real-world scenarios with hundreds of services, clusters, and users, things could be complex. Our experienced Istio consulting experts can provide valuable assistance. Our Istio support team specializes in configuring Istio for large-scale production deployments and excels at resolving emergency conflicts.
In addition to this, do share your thoughts on this blog post with me. You can connect with me on LinkedIn or Twitter.