Business Objective / Goal
To accelerate large-scale pharmacy and supplier data processing, reduce infrastructure cost, and ensure scalable, future-ready analytics capability for high-volume datasets.
Solutions & Implementation
- Migrated from a 5-node Vertica cluster to a 3-node Apache Spark setup using Hortonworks Data Platform (HDP) on AWS.
- Deployed Spark 1.3 with each node configured for high memory (30 GB) and SSD storage (80 GB).
- Ingested data via Spark Data Source APIs from databases and HDFS.
- Replaced Vertica processing logic with Spark UDFs and used Spark SQL to process DataFrames efficiently.
- Partitioned DataFrames for parallel execution across nodes, ensuring balanced load distribution.
- Integrated YARN as the cluster resource manager for high availability.
- Automated deployment of Spark jobs using Shell scripting for operational efficiency.
Major Technologies Used
- Apache Spark – Core distributed processing engine
- AWS – Cloud infrastructure for deployment
- Hortonworks (HDP) – Platform for Spark cluster management
- Vertica – Source system for migration
- YARN, Spark SQL, UDFs, Shell scripts
Business Outcomes
- 62% Performance Boost in Data Processing Reduced data batch processing time significantly by optimizing architecture and parallelization.
- From 2.2 Hours to 1 Hour for 1.2 Billion Records Improved throughput despite increasing data volume.
- 400% Reduction in IT Infrastructure Cost Migrated from Vertica to open-source Spark on AWS, minimizing licensing and maintenance expenses.
- High Availability & Scalability Achieved YARN-based cluster ensured smooth handling of large-scale data without performance bottlenecks.