Yelp Data Processing using Spark and Hive Part 2

Yelp Data Processing using Spark and Hive Part 2

In this spark project, we will continue building the data warehouse from the previous project Yelp Data Processing Using Spark And Hive Part 1 and will do further data processing to develop diverse data products.


Each project comes with 2-5 hours of micro-videos explaining the solution.

Code & Dataset

Get access to 50+ solved projects with iPython notebooks and datasets.

Project Experience

Add project experience to your Linkedin/Github profiles.

Customer Love

Read All Reviews

Dhiraj Tandon

Solution Architect-Cyber Security at ColorTokens

My Interaction was very short but left a positive impression. I enrolled and asked for a refund since I could not find the time. What happened next: They initiated Refund immediately. Their... Read More

Swati Patra

Systems Advisor , IBM

I have 11 years of experience and work with IBM. My domain is Travel, Hospitality and Banking - both sectors process lots of data. The way the projects were set up and the mentors' explanation was... Read More

What will you learn

Understanding the project agenda and how to implement HIVE and SPARK
Data structure designing
Importing the necessary libraries
Creating the environment using Cloudera VM ware
Data normalization and denormalization
Mapping different common attributes
Writing queries for creating and manipulating the tables in Hue editor
Introducing the time dimension
Coalescing the dataset columns into partitions using scala
Handling snapshots and incremental data loads
Performing basic Data preprocessing
Analyzing Correlation among different columns using Hue-Impala
Building executables and Importers in Scala
Taking Queries from users using Excel
Package our spark application as an executable using SBT and running them

Project Description

In the previous Spark Project on the same subject- Yelp Data Processing Using Spark And Hive Part 1, we began the development of Yelp dataset into domains that can easily be understood and consumed. Amongst other things we did

  • Various ways to ingest data using spark
  • Data transformation using Spark
  • Various ways of integrating spark and hive
  • Denormalize dataset into the hive tables thereby creating multiple datasets
  • Discuss how to handle snapshots and incremental data loads

In this Spark project, we are going to continue building the data warehouse. The purpose in this big data project is to do further data processing to deliver different kinds of the data products.

Similar Projects

Use the dataset on aviation for analytics to simulate a complex real-world big data pipeline based on messaging with AWS Quicksight, Druid, NiFi, Kafka, and Hive.

In this NoSQL project, we will use two NoSQL databases(HBase and MongoDB) to store Yelp business attributes and learn how to retrieve this data for processing or query.

Learn to write a Hadoop Hive Program for real-time querying.

Curriculum For This Mini Project

Summary of the project agenda
Summary of data structure design
Initial and subsequent import
Dealing with sparse attributes
Normalization of data
Create a Map of category to id
Create Businesses and caching
Query the tables
Time Dimension
Building data products
Time Dimension Analysis
Coalesce into partitions
Designing data products
Analysis on user reviews
Analysis on correlation between tips and reviews
Building executables
Building the importer
Finishing the executable
Make it easy for users to query data
Q&A - Spark RDDs
Lazy evaluation