AWS Project - Build an ETL Data Pipeline on AWS EMR Cluster

AWS Project - Build an ETL Data Pipeline on AWS EMR Cluster

Build a fully working scalable, reliable and secure AWS EMR complex data pipeline from scratch that provides support for all data stages from data collection to data analysis and visualization.
explanation image

Videos

Each project comes with 2-5 hours of micro-videos explaining the solution.

ipython image

Code & Dataset

Get access to 50+ solved projects with iPython notebooks and datasets.

project experience

Project Experience

Add project experience to your Linkedin/Github profiles.

Customer Love

Read All Reviews
profile image

Camille St. Omer linkedin profile url

Artificial Intelligence Researcher, Quora 'Most Viewed Writer in 'Data Mining'

I came to the platform with no experience and now I am knowledgeable in Machine Learning with Python. No easy thing I must say, the sessions are challenging and go to the depths. I looked at graduate... Read More

profile image

Mike Vogt linkedin profile url

Information Architect at Bank of America

I have had a very positive experience. The platform is very rich in resources, and the expert was thoroughly knowledgeable on the subject matter - real world hands-on experience. I wish I had this... Read More

What will you learn

End-to-end implementation of Big data pipeline on AWS using SAAS pattern (Software As A Service)
Scalable, reliable, secure data architecture followed by top notch Big data leaders
Detailed explanation of IAAS vs SAAS vs PAAS in Big Data scenarios
Extract the raw sales data into AWS S3
Spin up an EMR cluster on AWS with required configurations
Create a external Hive table on top of S3 for staging table purposes
Perform various ETLs in Hive to store the processed data into a hive managed table
Connect EMR Hive to Tableau desktop
Visualize various KPIs in the sales data using Tableau
Process of orchestration of the pipeline , extract refreshes in tableau server discussion

Project Description

In this Big Data project, a senior Big Data Architect will demonstrate how to implement a Big Data pipeline on AWS at scale. You will be using the sales dataset. Analyse sales data using highly competitive technology big data stack such as Amazon S3, EMR , Tableau to derive metrics out of the existing data . Big data pipelines built on AWS to serve batch ingestions of the data for various consumers according to their needs . This project is highly scalable and implemented on a very large scale organisation set up.

Similar Projects

In this project, we will evaluate and demonstrate how to handle unstructured data using Spark.

In this Spark project, we are going to bring processing to the speed layer of the lambda architecture which opens up capabilities to monitor application real time performance, measure real time comfort with applications and real time alert in case of security

Hive Project- Understand the various types of SCDs and implement these slowly changing dimesnsion in Hadoop Hive and Spark.

Curriculum For This Mini Project

Introduction and Architecture of the AWS project
02m
Exploration of the dataset
02m
Introduction to Amazon S3 and its features
05m
Introduction to Amazon EMR and its features
06m
SAAS - Software as a service
04m
PAAS vs IAAS (Platform vs Infrastructure as a service)
04m
Why choose EMR (Elastic Map Reduce)
06m
Hive vs Impala
02m
How to connect Tableau to EMR
03m
Creating EMR Cluster
04m
Login into EMR Hive project
04m
Upload data into Amazon S3
02m
Using Hive as ETL tool
05m
Hive final insertion
02m
Connect Tableau to Amazon EMR Hive
02m
Plot Charts
09m
Plot dual combination charts
09m
More complex dual combination charts in Tableau
09m
How to assemble different charts and build a dashboard in Tableau
13m