We helped the customer to implement autoscaling of workload as well as the kubernetes infrastructure/nodes to offer scalability based on demand and use resources in most efficient manner to save overall cost.
The customer is a venture-backed product company that helps government agencies in the US to operate and scale all government operations effectively by using state of the art software. They help many governments streamline operations such as ERP, Budgeting, financial management & citizen services, all via cloud-hosted software. They power more than 1000 governments.
The customer was using AWS & Kubernetes as an infrastructure platform and Datadog SaaS as the primary monitoring platform. They wanted to implement auto-scaling of workload and infrastructure/nodes to scale up and down based on demand and use resources in an optimal manner. The challenge was that Datadog at that point did not have direct integration with Kubernetes’s metric server.
The customer’s monitoring platform was built on Datadog and was used extensively for monitoring. Kubernetes’s native horizontal pod, autoscaler only integrated with the metric server for which there was no Datadog integration available.
The path to building a horizontal pod autoscaler was not clear. After the first consultation with InfraCloud, customer discovered the entire pipeline built of adapters for the newly introduced metric server. But during implementation, we found a better path of implementation – which would mean pivoting to a different design than earlier envisioned.
InfraCloud designed an end-to-end monitoring pipeline, as shown in the diagram below, using two adapters. The first adapter enabled passing metrics from Statsd agent to Prometheus by doing the appropriate conversion. The second adapter converted metrics from Prometheus to metric server format. This total pipeline allows custom metrics passed from application to metric server to be scaled.
While the implementation started based on initial design and testing was being carried out Datadog introduced the metric server compliant “Datadog cluster agent”. This would simplify the pipeline drastically by removing components and would also be supported by the provider. Infracloud recommended switching to a new design for simplicity. The implementation was then changed to a new architecture, as shown in the diagram below.
For scaling the nodes, we evaluated two options:
Escalator - An open source node autoscale from Atlassian.
Kubernetes Autoscaler - the official node autoscaler from Kubernetes community.
All requirements were evaluated and with a quick PoC, the Kubernetes Autoscaler was chosen as the go-to tool. Kubernetes autoscaler enabled scaling of nodes in clusters based on the resource needs of the application. Scaling up and scaling down policies were defined along with pod disruption budgets for each application.
Autoscaling helped the customer to scale the worker application instances based on messages in queue even though Datadog at that point did not support integration with autoscaling servers of Kubernetes.
With node autoscaler in place, the cost of infrastructure was optimised to the current resources in use and thus, overall cost saving was achieved.