Yelp Data Processing using Spark and Hive Part 2

Yelp Data Processing using Spark and Hive Part 2

In this spark project, we will continue building the data warehouse from the previous project Yelp Data Processing Using Spark And Hive Part 1 and will do further data processing to develop diverse data products.


Each project comes with 2-5 hours of micro-videos explaining the solution.

Code & Dataset

Get access to 50+ solved projects with iPython notebooks and datasets.

Project Experience

Add project experience to your Linkedin/Github profiles.

Customer Love

Read All Reviews

Ray Han

Tech Leader | Stanford / Yale University

I think that they are fantastic. I attended Yale and Stanford and have worked at Honeywell,Oracle, and Arthur Andersen(Accenture) in the US. I have taken Big Data and Hadoop,NoSQL, Spark, Hadoop... Read More

Prasanna Lakshmi T

Advisory System Analyst at IBM

Initially, I was unaware of how this would cater to my career needs. But when I stumbled through the reviews given on the website. I went through many of them and found them all positive. I would... Read More

What will you learn

Understanding the project agenda and how to implement HIVE and SPARK
Data structure designing
Importing the necessary libraries
Creating the environment using Cloudera VM ware
Data normalization and denormalization
Mapping different common attributes
Writing queries for creating and manipulating the tables in Hue editor
Introducing the time dimension
Coalescing the dataset columns into partitions using scala
Handling snapshots and incremental data loads
Performing basic Data preprocessing
Analyzing Correlation among different columns using Hue-Impala
Building executables and Importers in Scala
Taking Queries from users using Excel
Package our spark application as an executable using SBT and running them

Project Description

In the previous Spark Project on the same subject- Yelp Data Processing Using Spark And Hive Part 1, we began the development of Yelp dataset into domains that can easily be understood and consumed. Amongst other things we did

  • Various ways to ingest data using spark
  • Data transformation using Spark
  • Various ways of integrating spark and hive
  • Denormalize dataset into the hive tables thereby creating multiple datasets
  • Discuss how to handle snapshots and incremental data loads

In this Spark project, we are going to continue building the data warehouse. The purpose in this big data project is to do further data processing to deliver different kinds of the data products.

Similar Projects

In this project, we will look at two database platforms - MongoDB and Cassandra and look at the philosophical difference in how these databases work and perform analytical queries.

In this hive project, you will design a data warehouse for e-commerce environments.

In this big data project, we will embark on real-time data collection and aggregation from a simulated real-time system using Spark Streaming.

Curriculum For This Mini Project

Summary of the project agenda
Summary of data structure design
Initial and subsequent import
Dealing with sparse attributes
Normalization of data
Create a Map of category to id
Create Businesses and caching
Query the tables
Time Dimension
Building data products
Time Dimension Analysis
Coalesce into partitions
Designing data products
Analysis on user reviews
Analysis on correlation between tips and reviews
Building executables
Building the importer
Finishing the executable
Make it easy for users to query data
Q&A - Spark RDDs
Lazy evaluation