Skullzmedia, a rapidly growing digital marketing agency, faced challenges in managing their data collection process due to exponential data growth from forms, landing pages, and other sources. Their existing monolithic architecture struggled to keep up, leading to data loss and delays. Skullzmedia approached me to architect and implement a high-performance data collection pipeline.
I collaborated with the Skullzmedia team to design a tailored solution capable of ingesting 10,000 records per second from various data sources. The pipeline needed to ensure data integrity, fault tolerance, and efficient storage for real-time analysis and reporting.
I designed and implemented a distributed data collection pipeline using a microservices approach with message queues for loose coupling and scalability. The key components included:
Data Sources: Forms, landing pages, and other sources sent data to the ingestion endpoint using HTTP POST requests in JSON format.
API Gateway: Amazon API Gateway received incoming data, handled validation, authentication, and rate limiting, and forwarded it to the message queue. CORS (Cross-Origin Resource Sharing) was implemented to restrict access to the API Gateway, preventing unauthorized access and potential spam attacks.
RabbitMQ: A RabbitMQ cluster on Amazon EC2 handled message queueing and reliable data delivery, utilizing multiple queues and exchange types for routing.
Golang Workers: Stateless and horizontally scalable Golang worker processes consumed messages from RabbitMQ, performed transformations, and stored data in PostgreSQL. On-the-fly data validation was implemented within the worker processes to ensure data integrity and prevent the processing of malformed or invalid data.
PostgreSQL: A PostgreSQL database cluster on Amazon RDS stored the collected data with optimized schema design, indexes, and partitioning strategies.
AWS Infrastructure: The pipeline was deployed on AWS, leveraging EC2 for RabbitMQ and Golang workers, Auto Scaling for dynamic scaling, RDS for PostgreSQL, and CloudWatch for monitoring and logging. AWS Shield was utilized to protect against DDoS (Distributed Denial of Service) attacks, providing an additional layer of security.
Developed a custom Golang library for interacting with RabbitMQ, encapsulating connection management, error handling, and message acknowledgment logic.
Implemented connection pooling using the pgx package in Golang workers to optimize database performance and handle high concurrency.
Utilized AWS Auto Scaling to automatically adjust worker instances based on incoming data volume, ensuring optimal resource utilization and cost efficiency.
Set up comprehensive monitoring and alerting using CloudWatch, tracking KPIs such as data ingestion rate, processing latency, and error rates.
Implemented CORS configuration in the API Gateway to allow only trusted origins and prevent unauthorized access from external sources.
Developed a robust data validation framework in the Golang workers to validate incoming data against predefined schemas, rejecting invalid or malformed records and preventing their storage in the database.
Data Ingestion Rate: Consistently achieved a sustained data ingestion rate of 10,000 records per second.
Data Processing Latency: Maintained an average end-to-end latency of under 500 milliseconds from data ingestion to storage.
System Availability: Achieved an uptime of 99.9% through redundant components, auto-scaling, and fault-tolerant design patterns.
Data Accuracy: Implemented data validation and error handling, resulting in a data accuracy rate of 99.99%.
The implemented data collection pipeline successfully met Skullzmedia's requirements, providing seamless ingestion of data at a sustained rate of 10,000 records per second, reliable message delivery and processing using RabbitMQ, efficient storage and retrieval with PostgreSQL, and a scalable infrastructure on AWS capable of handling increasing data volumes and future growth.
The incorporation of CORS configuration in the API Gateway and on-the-fly data validation in the Golang workers enhanced the security and reliability of the pipeline. By restricting access to trusted origins and validating incoming data, the system effectively mitigated the risks of spam attacks and ensured the integrity of the stored data.
The utilization of AWS Shield provided an additional layer of protection against DDoS attacks, ensuring the availability and responsiveness of the data collection pipeline even in the face of malicious traffic.
The scalable data collection pipeline has empowered Skullzmedia to collect and analyze data efficiently, driving their marketing strategies and delivering exceptional results for their clients. The success of this project demonstrates the value of a well-architected data ingestion system in enabling data-driven decision-making and unlocking business growth while maintaining a robust security posture.