Implemented a distributed data lake architecture and advanced analytics on the AWS cloud platform to reduce IT costs and improve productivity
Boosting Performance with Apache Spark Migration

About the Client
The Business Challenge
The client wanted to improve and accelerate analytics-driven decisions and reduce the time for data analysis, data analytics, and data reporting on both structured and unstructured data. Furthermore, the client wanted to improve the deviation tracking of mitigation tasks and reduce the system stack cost by enabling an open-source, industrial-grade platform. Additionaly, the client also wanted to prepare ground and infra for AI/ML and advanced analytics
What Aptus Data Labs Did
We migrated the client’s existing 5-node Vertica Cluster platform to Apache Spark in Hortonworks on AWS Cluster to improve the processing time and quickly adapt to new features in the future along with cost reduction.
The Impact Aptus Data Labs Made
The new analytics platform boosted the performance by 62% and reduced the data processing time. In addition, it also reduced IT costs by 400% and helped the client to handle large volumes of data smoothly.
The Business and Technology Approach
Aptus Data Labs used the following methodology for environment migration and to resolve the existing challenge. Aptus Data Labs
- Deployed a 3-node HDP cluster with Apache Spark 1.3 on AWS with each node containing 30 GB memory and 80 GB solid-state drive
- Used Spark data source API to ingest data from both database and HDFS source
- Utilized DataFrames to store structured relational data instead of using traditional RDDs
- Replaced the processing procedures in Vertica with User Defined Functions (UDF) in Spark
- Deployed Spark SQL to pass DataFrames to the UDF for processing
- Partitioned the DataFrame to process across all nodes parallel
- Replaced Spark resource manager with Yarn resource manager to boot high availability of the cluster
- Implemented Shell scripts to deploy and automate Spark jobs
Tools Used
- Apache Spark Cluster
- AWS
- HDP platform
- Spark
- Vertica
The Outcome
The migrated analytics platform reduced the processing time from 2.2 hours for a billion records to 1 hour for 1.2 billion records that boosted the performance by 62%. Moreover, the analytics platform reduced IT costs significantly using open-source technologies. In addition, the platform used the yarn cluster to ensure high availability and high efficiency of the system. Furthermore, it also enabled the client to handle massive volumes of data smoothly without any break in the performance.
Related Case Studies
Unlock the Potential of Data Science with Aptus Data Labs
Don't wait to harness the power of data science - contact Aptus Data Labs today and start seeing results.
If you’re looking to take your business to the next level with data science, we invite you to contact us today to schedule a consultation. Our team will work with you to assess your current data landscape and develop a customized solution that will help you gain valuable insights and drive growth.