Yelp Data Processing using Spark and Hive Part 2

Yelp Data Processing using Spark and Hive Part 2

In this spark project, we will continue building the data warehouse from the previous project Yelp Data Processing Using Spark And Hive Part 1 and will do further data processing to develop diverse data products.


Each project comes with 2-5 hours of micro-videos explaining the solution.

Code & Dataset

Get access to 50+ solved projects with iPython notebooks and datasets.

Project Experience

Add project experience to your Linkedin/Github profiles.

What will you learn

Understanding the project agenda and how to implement HIVE and SPARK
Data structure designing
Importing the necessary libraries
Creating the environment using Cloudera VM ware
Data normalization and denormalization
Mapping different common attributes
Writing queries for creating and manipulating the tables in Hue editor
Introducing the time dimension
Coalescing the dataset columns into partitions using scala
Handling snapshots and incremental data loads
Performing basic Data preprocessing
Analyzing Correlation among different columns using Hue-Impala
Building executables and Importers in Scala
Taking Queries from users using Excel
Package our spark application as an executable using SBT and running them

Project Description

In the previous Spark Project on the same subject- Yelp Data Processing Using Spark And Hive Part 1, we began the development of Yelp dataset into domains that can easily be understood and consumed. Amongst other things we did

  • Various ways to ingest data using spark
  • Data transformation using Spark
  • Various ways of integrating spark and hive
  • Denormalize dataset into the hive tables thereby creating multiple datasets
  • Discuss how to handle snapshots and incremental data loads

In this Spark project, we are going to continue building the data warehouse. The purpose in this big data project is to do further data processing to deliver different kinds of the data products.

Similar Projects

Analyze clickstream data of a website using Hadoop Hive to increase sales by optimizing every aspect of the customer experience on the website from the first mouse click to the last.

In this big data project, we will discover songs for those artists that are associated with the different cultures across the globe.

In this big data project, we will look at how to mine and make sense of connections in a simple way by building a Spark GraphX Algorithm and a Network Crawler.

Curriculum For This Mini Project

Summary of the project agenda
Summary of data structure design
Initial and subsequent import
Dealing with sparse attributes
Normalization of data
Create a Map of category to id
Create Businesses and caching
Query the tables
Time Dimension
Building data products
Time Dimension Analysis
Coalesce into partitions
Designing data products
Analysis on user reviews
Analysis on correlation between tips and reviews
Building executables
Building the importer
Finishing the executable
Make it easy for users to query data
Q&A - Spark RDDs
Lazy evaluation