Customer Overview
The customer is UK based leading intelligent digital security solutions provider. They have security devices that are installed across the world in different facilities.
All these security devices can be managed remotely from their cloud solutions or via mobile devices. In the market for the last 35 years, almost 120K devices are out in the real world.
Context & Challenge
The customer had two significant problem statements:
-
Optimizing configuration management for the servers. They were using Puppet for configuration management and wanted to get rid of configuration at the boot time model as this was adding latency for new nodes to join the cluster during autoscaling events.
-
They were running mosquito brokers to communicate with IoT devices using the MQTT protocol. Mosquito brokers used to be added on demand and configured for certain IoT devices as communication points. This solution was not having a single point of contact for the MQTT brokers, and scaling was not possible as well. They wanted to have a cluster of MQTT brokers that could be load balanced, autoscaled and replicated.
The engagement started with a technical deep dive session to understand the current state of the system. After that, we came up with a detailed proposal illustrating the tools and tech stack that will be utilized to overcome these challenges.
Solutions Deployed
Abstracting configuration using Machine Images
We migrated configuration management from Puppet to Ansible. To reduce the latency of having configuration during boot time, we moved all required packages into Golden AMI. This golden AMI was created using Packer and Ansible playbooks.
CI/CD flow using Packer, Ansible and CloudFormation
MQTT broker clustering using EMQX
To achieve high availability and reliability for the MQTT brokers, we used EMQX cluster. EMQX (Elastic MQTT Broker) is an open source IoT MQTT message broker based on the Erlang/OTP platform. EMQX is designed for massive client access and realizes fast and low-latency message routing between massive physical network devices.
The solution was deployed on AWS EKS. EMQX supports node discovery and autoscaling. Kubernetes HPA was used to create scaling policies for brokers. It also supports the replication of broker data i.e info about topics, clients, and global routing table.
Observability
Added two layers of observability here. One for base Kubernetes clusters using standard Prometheus metrics exporters. This gives metrics about base infra scaling and availability. Other for scraping MQTT-specific metrics like subscription counts, connections, etc. EMQX Prometheus agent was used for this.
The customer was already using Zabbix for monitoring & visualizing their infrastructure. So we utilized Zabbix for visualization and connected it to the Prometheus data sources that we provisioned.
Testing
We used EMQTT_BENCH to gather performance metrics for this setup.
On a 3 node cluster with resources of around 22vCPU and 48Gi Memory, we were able to handle 200k device connections (subscribers) with around 10 publishers. We observed an almost 99-100% message-receiving rate.
Benefits
Earlier customers had to provision different nodes based on the demand at that moment. This was not scalable and also idle node choices could not be made. Sometimes it used to be underutilized, leaving high infra cost, and sometimes it is overutilized introducing downtime for migration to new node sizes as scalability was not an option.
Currently, they have almost 120k devices out in the real environment. With the current setup of 3 nodes of the same size, we tested them to be reliable for almost 180k devices. This is a straight 50% increase in the number of devices that can be added to the same infra.
On top of infrastructure costing, there is a significant operational saving that they achieved. Now managing all the brokers is completely the same. Adding any new things to brokers, like Authentication, Hardening base infra, etc, will be done for all the brokers simultaneously. This eliminates the chances of configuration drifts and maintenance overhead.
Since the entire solution is now on a Kubernetes environment, smart logic for downscaling and upscaling can be added to save more effectively.
Why Infracloud
-
We are more than just service providers; we help our customers build better products from day 1 of engagement. Weโre helping customers shape products from an engineering perspective & business value.
-
Also, we are a premier technology company with DevOps engineers who have pioneered DevOps with Container, Container Orchestration, Cloud Platforms, DevOps, Infrastructure Automation, SDN, and Big Data Infrastructure solutions.
-
Our deep expertise and focus in these areas enable our customers to build better software faster.
-
We have 50 Kubernetes certified specialists that you can trust - CKA, CKAD & CKS.
-
We are also one of the cloud-native technology thought leaders, with speakers and authors contributing to global CNCF conferences