Yelp Data Processing using Spark and Hive Part 2

Yelp Data Processing using Spark and Hive Part 2

In this spark project, we will continue building the data warehouse from the previous project Yelp Data Processing Using Spark And Hive Part 1 and will do further data processing to develop diverse data products.

Videos

Each project comes with 2-5 hours of micro-videos explaining the solution.

Code & Dataset

Get access to 50+ solved projects with iPython notebooks and datasets.

Project Experience

Add project experience to your Linkedin/Github profiles.

What will you learn

Understanding the project agenda and how to implement HIVE and SPARK
Data structure designing
Importing the necessary libraries
Creating the environment using Cloudera VM ware
Data normalization and denormalization
Mapping different common attributes
Writing queries for creating and manipulating the tables in Hue editor
Introducing the time dimension
Coalescing the dataset columns into partitions using scala
Handling snapshots and incremental data loads
Performing basic Data preprocessing
Analyzing Correlation among different columns using Hue-Impala
Building executables and Importers in Scala
Taking Queries from users using Excel
Package our spark application as an executable using SBT and running them

Project Description

In the previous Spark Project on the same subject- Yelp Data Processing Using Spark And Hive Part 1, we began the development of Yelp dataset into domains that can easily be understood and consumed. Amongst other things we did

  • Various ways to ingest data using spark
  • Data transformation using Spark
  • Various ways of integrating spark and hive
  • Denormalize dataset into the hive tables thereby creating multiple datasets
  • Discuss how to handle snapshots and incremental data loads

In this Spark project, we are going to continue building the data warehouse. The purpose in this big data project is to do further data processing to deliver different kinds of the data products.

Similar Projects

Learn to write a Hadoop Hive Program for real-time querying.

Learn to design Hadoop Architecture and understand how to store data using data acquisition tools in Hadoop.

The goal of this IoT project is to build an argument for generalized streaming architecture for reactive data ingestion based on a microservice architecture. 

Curriculum For This Mini Project

Summary of the project agenda
05m
Summary of data structure design
06m
Initial and subsequent import
19m
Dealing with sparse attributes
08m
Normalization of data
18m
Create a Map of category to id
15m
Create Businesses and caching
16m
Query the tables
18m
Time Dimension
16m
Building data products
07m
Time Dimension Analysis
20m
Coalesce into partitions
06m
Designing data products
10m
Analysis on user reviews
12m
Analysis on correlation between tips and reviews
18m
Building executables
14m
Building the importer
17m
Finishing the executable
08m
Make it easy for users to query data
07m
Q&A - Spark RDDs
05m
Lazy evaluation
06m