Client Details
The customer has an advertisement bidding platform used by the largest retail company in the world. They process ~1BN messages per day.
Context
InfraCloud team built a global Kafka messaging platform on top of Mesos. The messages were processed by a big data pipeline.
Challenges
- The platform infrastructure should be cloud-agnostic, and it should be possible to deploy the whole platform at the click of a button with four 9s availability.
- Designing and implementing a Kafka based message queue platform to handle a high volume of messages for ad bidding platform (~ 1Bn per day)
- Some solution components were to be integrated with the big data pipeline; the big data solution itself was out of scope for this project.
Solution
- For container orchestration, we evaluated Kubernetes and DCOS Mesos. Based on maturity at that point and the requirement, we chose Mesos DCOS (2015).
- We had done POCs on Kafka and related management tools for visibility and monitoring for the message queue.
- We evaluated Ansible and SaltStack platform automation. We chose Saltstack as a solution.
- The prototype phase aimed to find the technologies fit for the use case by building and testing at a prototype scale.

Solution - Infrastructure
- Cloud agnostic platform built with Saltstack & Mesos. Saltstack provisions the infra in any cloud and then sets up the Mesos cluster.
- On top of Mesos cluster services such as Kafka, ElasticSearch deployed using the DCOS framework. Application deployed as Dockerized containers using the Marathon framework.
- We replicated the Kafka cluster across availability zones (AZ) with the Kafka mirror.

Solution - Messaging Platform
- We designed regional Kafka clusters for separating incoming messages based on scale and demand. Regional Kafka clusters were closed to the edge in 7 regions globally.
- Central Kafka cluster for finally sending filtered messages to the data pipeline
- A replication factor of 3 at the Kafka cluster level for high redundancy.
- S3 persistence for future replay and backup.
- Kafkacat for immediately replaying failed messages, if any.
- We used Camus to convert Kafka messages to HDFS and used by Hadoop batch jobs and Spark jobs.