Yelp Data Processing Using Spark And Hive Part 1

In this big data project, we will continue from a previous hive project "Data engineering on Yelp Datasets using Hadoop tools" and do the entire data processing using spark.


Each project comes with 2-5 hours of micro-videos explaining the solution.

Code & Dataset

Get access to 50+ solved projects with iPython notebooks and datasets.

Project Experience

Add project experience to your Linkedin/Github profiles.

What will you learn

  • Introduction to the dataset and objectives of this project

  • What are JSON types of file and the data schema of JSON

  • Reading the Data and transforming it into Hive table

  • What is Ingestion and How to do Ingestion of data using Spark

  • How to create data storing and distribution center

  • Creating customized HDFS and saving data

  • Various ways of integrating Hive with Spark

  • Saving a file and building a Hive table on output

  • What are Normalization and Denormalization and its use

  • Normalizing and denormalizing dataset into hive tables

  • Joining different datasets

  • Various complex data structures in Hive through spark

  • Transforming the size of a table

  • Writing customized Query in Hive performing Self joins among tables

  • Various Complex Data Structure in Hive through Spark

  • Understanding the arrays and designing the final analysis

Project Description

Data engineering is the science of acquiring, aggregating or collection, processing and storage of data either in batch or in real time as well as providing variety of means of serving these data to other users which could include a data scientist. It involves software engineering practises on big data.

In this big data project for beginners, we will continue from a previous hive project on "Data engineering on Yelp Datasets using Hadoop tools" where we applied some data engineering principles to the Yelp Dataset in the areas of processing, storage and retrieval. Like in that session, We will not include data ingestion since we are already downloading the data from the yelp challenge website. But unlike that session, we will focus on doing the entire data processing using spark.

Similar Projects

Big Data Project Design a Network Crawler by Mining Github Social Profiles
In this big data project, we will look at how to mine and make sense of connections in a simple way by building a Spark GraphX Algorithm and a Network Crawler.
Big Data Project Building a Data Warehouse using Spark on Hive
In this hive project , we will build a Hive data warehouse from a raw dataset stored in HDFS and present the data in a relational structure so that querying the data will be natural.
Big Data Project Tough engineering choices with large datasets in Hive Part - 2
This is in continuation of the previous Hive project "Tough engineering choices with large datasets in Hive Part - 1", where we will work on processing big data sets using Hive.
Big Data Project Microsoft Cortana Intelligence Suite Analytics Workshop
In this big data project, we'll work through a real-world scenario using the Cortana Intelligence Suite tools, including the Microsoft Azure Portal, PowerShell, and Visual Studio.

Curriculum For This Mini Project

  Introduction to the Yelp dataset
  Objectives of this project
  Introduction to the JSON schema
  Read the data and transform to Hive parquet table
  Ingest Json data using Spark
  Write to HDFS
  Integrate Hive with Spark
  Understanding Normalizing and Denormalizing
  Normalizing and Denormalizing datasets into Hive tables
  Transform the table and write in a single line
  Query to find users with more followers than their friends
  Error troubleshooting
  Initial import of data
  Exploring various data structures
  Exploring arrays
  Designing the analysis