Create a data pipeline based on messaging using Spark and Hive

In this spark project, we will simulate a simple real-world batch data pipeline based on messaging using Spark and Hive.
Videos
Each project comes with 2-5 hours of micro-videos explaining the solution.
Code & Dataset
Get access to 50+ solved projects with iPython notebooks and datasets.
Project Experience
Add project experience to your Linkedin/Github profiles.

What will you learn

  • Designing a data pipeline based on messaging
  • Load data from a remote URL
  • Spark transformation
  • Launching a Spark application from your application
  • Explore hive as a backend for structured data access
  • Discuss pipeline automation with Oozie or Airflow as another option

Project Description

A data pipeline is a set of actions that are performed from the time data is available for ingestion till value is derived from that data. Such kind of actions is Extraction (getting value field from the dataset), Transformation and Loading (putting the data of value in a form that is useful for upstream use).

In this big data project, we will simulate a simple batch data pipeline. Our dataset of interest we will get from https://www.githubarchive.org/ that records over 20 kinds of events.

The objective of this spark project will be to create a small but real-world pipeline that downloads this dataset as they become available, initiated the various form of transformation and load them into forms of storage that will need further use.

This spark project involves some form of software development, however, we will explore Spark and Hive while discussing some design decisions along the way.

Curriculum For This Mini Project

 
  Introduction to Data Pipeline
05m
  Accessing the data sources
15m
  Introduction to Events
11m
  How to write an Event
04m
  Designing a data pipeline using messaging
05m
  DAG versus Messaging based pipeline
11m
  Building a file pipeline utility
11m
  Executing the file pipeline utility
08m
  Create Kafka topics
06m
  Testing the data pipeline
04m
  Troubleshooting the data pipeline
04m
  Spark Transformation
10m
  Spark Scala Producer-Consumer
03m
  Launch spark application with spark-submit
10m
  Scala bin tool installation
01m
  Troubleshooting the spark application
03m
  Spark Transformation
03m
  Spark Transformation - Create Event
09m
  Spark Transformation - Create Event - Table Structure
09m
  Spark Transformation - Create generic table
12m
  Create the specific tables
16m
  Automated run to move data to Hive table
06m
  Troubleshooting
14m
  Troubleshooting - import hive context
07m
  Output of the Spark jobs
11m
  Conclusion
04m