Yelp Data Processing Using Spark And Hive Part 1

In this big data project, we will continue from a previous hive project "Data engineering on Yelp Datasets using Hadoop tools" and do the entire data processing using spark.

Users who bought this project also bought

What will you learn

  • Doing data processing using Spark
  • Normalizing and denormalizing dataset into hive tables
  • Various ways of integrating Hive and Spark
  • Various complex data structures in Hive through spark
  • Exporting some of the processed datasets to RDBMS

What will you get

  • Access to recording of the complete project
  • Access to all material related to project like data files, solution files etc.

Project Description

Data engineering is the science of acquiring, aggregating or collection, processing and storage of data either in batch or in real time as well as providing variety of means of serving these data to other users which could include a data scientist. It involves software engineering practises on big data.

In this big data project for beginners, we will continue from a previous hive project on "Data engineering on Yelp Datasets using Hadoop tools" where we applied some data engineering principles to the Yelp Dataset in the areas of processing, storage and retrieval. Like in that session, We will not include data ingestion since we are already downloading the data from the yelp challenge website. But unlike that session, we will focus on doing the entire data processing using spark.

Curriculum For This Mini Project

 
  Introduction to the Yelp dataset
02m
  Objectives of this project
03m
  Introduction to the JSON schema
09m
  Agenda
00m
  Read the data and transform to Hive parquet table
06m
  Ingest Json data using Spark
11m
  Write to HDFS
09m
  Integrate Hive with Spark
33m
  Understanding Normalizing and Denormalizing
16m
  Normalizing and Denormalizing datasets into Hive tables
39m
  Transform the table and write in a single line
08m
  Query to find users with more followers than their friends
05m
  Error troubleshooting
01m
  Initial import of data
16m
  Exploring various data structures
19m
  Exploring arrays
17m
  Designing the analysis
17m