1-844-696-6465 (US)        +91 77600 44484        help@dezyre.com

Yelp Data Processing using Spark and Hive Part 2

In this spark project, we will continue building the data warehouse from the previous project Yelp Data Processing Using Spark And Hive Part 1 and will do further data processing to develop diverse data products.

Users who bought this project also bought

What will you learn

  • Data normalization and denormalization
  • Handling snapshots and incremental data loads
  • Introducting the time dimension
  • Analysis of data products
  • Package our spark application as an executable using sbt and running them

What will you get

  • Access to recording of the complete project
  • Access to all material related to project like data files, solution files etc.

Project Description

In the previous Spark Project on the same subject- Yelp Data Processing Using Spark And Hive Part 1, we began the development of Yelp dataset into domains that can easily be understood and consumed. Amongst other things we did

  • Various ways to ingest data using spark
  • Data transformation using Spark
  • Various ways of integrating spark and hive
  • Denormalize dataset into the hive tables thereby creating multiple datasets
  • Discuss how to handle snapshots and incremental data loads

In this Spark project, we are going to continue building the data warehouse. The purpose in this big data project is to do further data processing to deliver different kinds of the data products.

Curriculum For This Mini Project

 
  Summary of the project agenda
05m
  Summary of data structure design
06m
  Initial and subsequent import
19m
  Dealing with sparse attributes
08m
  Normalization of data
18m
  Create a Map of category to id
15m
  Create Businesses and caching
16m
  Query the tables
18m
  Time Dimension
16m
  Building data products
07m
  Time Dimension Analysis
20m
  Coalesce into partitions
06m
  Designing data products
10m
  Analysis on user reviews
12m
  Analysis on correlation between tips and reviews
18m
  Building executables
14m
  Building the importer
17m
  Finishing the executable
08m
  Make it easy for users to query data
07m
  Q&A - Spark RDDs
05m
  Lazy evaluation
06m