AWS Project - Build an ETL Data Pipeline on AWS EMR Cluster

AWS Project - Build an ETL Data Pipeline on AWS EMR Cluster

Build a fully working scalable, reliable and secure AWS EMR complex data pipeline from scratch that provides support for all data stages from data collection to data analysis and visualization.
explanation image


Each project comes with 2-5 hours of micro-videos explaining the solution.

ipython image

Code & Dataset

Get access to 50+ solved projects with iPython notebooks and datasets.

project experience

Project Experience

Add project experience to your Linkedin/Github profiles.

Customer Love

Read All Reviews
profile image

Camille St. Omer linkedin profile url

Artificial Intelligence Researcher, Quora 'Most Viewed Writer in 'Data Mining'

I came to the platform with no experience and now I am knowledgeable in Machine Learning with Python. No easy thing I must say, the sessions are challenging and go to the depths. I looked at graduate... Read More

profile image

Mike Vogt linkedin profile url

Information Architect at Bank of America

I have had a very positive experience. The platform is very rich in resources, and the expert was thoroughly knowledgeable on the subject matter - real world hands-on experience. I wish I had this... Read More

What will you learn

End-to-end implementation of Big data pipeline on AWS using SAAS pattern (Software As A Service)
Scalable, reliable, secure data architecture followed by top notch Big data leaders
Detailed explanation of IAAS vs SAAS vs PAAS in Big Data scenarios
Extract the raw sales data into AWS S3
Spin up an EMR cluster on AWS with required configurations
Create a external Hive table on top of S3 for staging table purposes
Perform various ETLs in Hive to store the processed data into a hive managed table
Connect EMR Hive to Tableau desktop
Visualize various KPIs in the sales data using Tableau
Process of orchestration of the pipeline , extract refreshes in tableau server discussion

Project Description

In this Big Data project, a senior Big Data Architect will demonstrate how to implement a Big Data pipeline on AWS at scale. You will be using the sales dataset. Analyse sales data using highly competitive technology big data stack such as Amazon S3, EMR , Tableau to derive metrics out of the existing data . Big data pipelines built on AWS to serve batch ingestions of the data for various consumers according to their needs . This project is highly scalable and implemented on a very large scale organisation set up.

Similar Projects

In this project, we will evaluate and demonstrate how to handle unstructured data using Spark.

In this Spark project, we are going to bring processing to the speed layer of the lambda architecture which opens up capabilities to monitor application real time performance, measure real time comfort with applications and real time alert in case of security

Hive Project- Understand the various types of SCDs and implement these slowly changing dimesnsion in Hadoop Hive and Spark.

Curriculum For This Mini Project

Introduction and Architecture of the AWS project
Exploration of the dataset
Introduction to Amazon S3 and its features
Introduction to Amazon EMR and its features
SAAS - Software as a service
PAAS vs IAAS (Platform vs Infrastructure as a service)
Why choose EMR (Elastic Map Reduce)
Hive vs Impala
How to connect Tableau to EMR
Creating EMR Cluster
Login into EMR Hive project
Upload data into Amazon S3
Using Hive as ETL tool
Hive final insertion
Connect Tableau to Amazon EMR Hive
Plot Charts
Plot dual combination charts
More complex dual combination charts in Tableau
How to assemble different charts and build a dashboard in Tableau