1-844-696-6465 (US)        +91 77600 44484        help@dezyre.com
build-a-data-pipeline-based-on-messaging-using-spark-and-hive.jpg

Create a data pipeline based on messaging using Spark and Hive

In this project, we will simulate a simple real-world batch data pipeline based on messaging using Spark and Hive.
4.64.6

Users who bought this project also bought

What will you learn

  • Designing a data pipeline based on messaging
  • Load data from a remote URL
  • Spark transformation
  • Launching a Spark application from your application
  • Explore hive as a backend for structured data access
  • Discuss pipeline automation with Oozie or Airflow as another option

What will you get

  • Access to recording of the complete project
  • Access to all material related to project like data files, solution files etc.

Prerequisites

  • Cloudera Quickstart VM
  • Have a fair understand of Spark, Hive and how to use them

Project Description

A data pipeline is a set of actions that are performed from the time data is available for ingestion till value is derived from that data. Such kind of actions is Extraction (getting value field from the dataset), Transformation and Loading (putting the data of value in a form that is useful for upstream use).

In this project, we will simulate a simple batch data pipeline. Our dataset of interest we will get from https://www.githubarchive.org/ that records over 20 kinds of events.

The objective of this spark project will be to create a small but real-world pipeline that downloads this dataset as they become available, initiated the various form of transformation and load them into forms of storage that will need further use.

The hackerday involves some form of software development, however, we will explore Spark and Hive while discussing some design decisions along the way.

Instructors

 
Michael

Big Data & Enterprise Software Engineer

I am passionate about software development, databases, data analysis and the android platform. My native language is java but no one has stopped me so far from learning and using angular and node.js. Data and data analysis is thrilling and so are my experiences with SQL on Oracle, Microsoft SQL Server, Postgres and MyS see more...

Curriculum For This Mini Project

 
  Introduction to Data Pipeline
00:05:55
  Accessing the data sources
00:15:45
  Introduction to Events
00:11:21
  How to write an Event
00:04:56
  Designing a data pipeline using messaging
00:05:12
  DAG versus Messaging based pipeline
00:11:50
  Building a file pipeline utility
00:11:46
  Executing the file pipeline utility
00:08:06
  Create Kafka topics
00:06:50
  Testing the data pipeline
00:04:53
  Troubleshooting the data pipeline
00:04:22
  Spark Transformation
00:10:54
  Spark Scala Producer-Consumer
00:03:02
  Launch spark application with spark-submit
00:10:48
  Scala bin tool installation
00:01:39
  Troubleshooting the spark application
00:03:44
  Spark Transformation
00:03:58
  Spark Transformation - Create Event
00:09:28
  Spark Transformation - Create Event - Table Structure
00:09:49
  Spark Transformation - Create generic table
00:12:49
  Create the specific tables
00:16:51
  Automated run to move data to Hive table
00:06:32
  Troubleshooting
00:14:10
  Troubleshooting - import hive context
00:07:50
  Output of the Spark jobs
00:11:54
  Conclusion
00:04:16