AWS Project - Build an ETL Data Pipeline on AWS EMR Cluster

AWS Project - Build an ETL Data Pipeline on AWS EMR Cluster

Build a fully working scalable, reliable and secure AWS EMR complex data pipeline from scratch that provides support for all data stages from data collection to data analysis and visualization.


Each project comes with 2-5 hours of micro-videos explaining the solution.

Code & Dataset

Get access to 50+ solved projects with iPython notebooks and datasets.

Project Experience

Add project experience to your Linkedin/Github profiles.

Customer Love

Read All Reviews

Shailesh Kurdekar

Solutions Architect at Capital One

I have worked for more than 15 years in Java and J2EE and have recently developed an interest in Big Data technologies and Machine learning due to a big need at my workspace. I was referred here by a... Read More

Camille St. Omer

Artificial Intelligence Researcher, Quora 'Most Viewed Writer in 'Data Mining'

I came to the platform with no experience and now I am knowledgeable in Machine Learning with Python. No easy thing I must say, the sessions are challenging and go to the depths. I looked at graduate... Read More

What will you learn

End-to-end implementation of Big data pipeline on AWS using SAAS pattern (Software As A Service)
Scalable, reliable, secure data architecture followed by top notch Big data leaders
Detailed explanation of IAAS vs SAAS vs PAAS in Big Data scenarios
Extract the raw sales data into AWS S3
Spin up an EMR cluster on AWS with required configurations
Create a external Hive table on top of S3 for staging table purposes
Perform various ETLs in Hive to store the processed data into a hive managed table
Connect EMR Hive to Tableau desktop
Visualize various KPIs in the sales data using Tableau
Process of orchestration of the pipeline , extract refreshes in tableau server discussion

Project Description

In this Big Data project, a senior Big Data Architect will demonstrate how to implement a Big Data pipeline on AWS at scale. You will be using the sales dataset. Analyse sales data using highly competitive technology big data stack such as Amazon S3, EMR , Tableau to derive metrics out of the existing data . Big data pipelines built on AWS to serve batch ingestions of the data for various consumers according to their needs . This project is highly scalable and implemented on a very large scale organisation set up.

Similar Projects

This is in continuation of the previous Hive project "Tough engineering choices with large datasets in Hive Part - 1", where we will work on processing big data sets using Hive.

In this Databricks Azure tutorial project, you will use Spark Sql to analyse the movielens dataset to provide movie recommendations. As part of this you will deploy Azure data factory, data pipelines and visualise the analysis.

In this project, we will look at Cassandra and how it is suited for especially in a hadoop environment, how to integrate it with spark, installation in our lab environment.

Curriculum For This Mini Project

Introduction and Architecture of the AWS project
Exploration of the dataset
Introduction to Amazon S3 and its features
Introduction to Amazon EMR and its features
SAAS - Software as a service
PAAS vs IAAS (Platform vs Infrastructure as a service)
Why choose EMR (Elastic Map Reduce)
Hive vs Impala
How to connect Tableau to EMR
Creating EMR Cluster
Login into EMR Hive project
Upload data into Amazon S3
Using Hive as ETL tool
Hive final insertion
Connect Tableau to Amazon EMR Hive
Plot Charts
Plot dual combination charts
More complex dual combination charts in Tableau
How to assemble different charts and build a dashboard in Tableau