Yelp Data Processing Using Spark And Hive Part 1

Yelp Data Processing Using Spark And Hive Part 1

In this big data project, we will continue from a previous hive project "Data engineering on Yelp Datasets using Hadoop tools" and do the entire data processing using spark.
explanation image

Videos

Each project comes with 2-5 hours of micro-videos explaining the solution.

ipython image

Code & Dataset

Get access to 50+ solved projects with iPython notebooks and datasets.

project experience

Project Experience

Add project experience to your Linkedin/Github profiles.

Customer Love

Read All Reviews
profile image

Mike Vogt linkedin profile url

Information Architect at Bank of America

I have had a very positive experience. The platform is very rich in resources, and the expert was thoroughly knowledgeable on the subject matter - real world hands-on experience. I wish I had this... Read More

profile image

Arvind Sodhi linkedin profile url

VP - Data Architect, CDO at Deutsche Bank

I have extensive experience in data management and data processing. Over the past few years I saw the data management technology transition into the Big Data ecosystem and I needed to follow suit. I... Read More

What will you learn

Introduction to the dataset and objectives of this project
What are JSON types of file and the data schema of JSON
Reading the Data and transforming it into Hive table
What is Ingestion and How to do Ingestion of data using Spark
How to create data storing and distribution center
Creating customized HDFS and saving data
Various ways of integrating Hive with Spark
Saving a file and building a Hive table on output
What are Normalization and Denormalization and its use
Normalizing and denormalizing dataset into hive tables
Joining different datasets
Various complex data structures in Hive through spark
Transforming the size of a table
Writing customized Query in Hive performing Self joins among tables
Various Complex Data Structure in Hive through Spark
Understanding the arrays and designing the final analysis

Project Description

Data engineering is the science of acquiring, aggregating or collection, processing and storage of data either in batch or in real time as well as providing variety of means of serving these data to other users which could include a data scientist. It involves software engineering practises on big data.

In this big data project for beginners, we will continue from a previous hive project on "Data engineering on Yelp Datasets using Hadoop tools" where we applied some data engineering principles to the Yelp Dataset in the areas of processing, storage and retrieval. Like in that session, We will not include data ingestion since we are already downloading the data from the yelp challenge website. But unlike that session, we will focus on doing the entire data processing using spark.

Similar Projects

In this big data project, we will be performing an OLAP cube design using AdventureWorks database. The deliverable for this session will be to design a cube, build and implement it using Kylin, query the cube and even connect familiar tools (like Excel) with our new cube.

In this Spark project, we are going to bring processing to the speed layer of the lambda architecture which opens up capabilities to monitor application real time performance, measure real time comfort with applications and real time alert in case of security

In this project, we will walk through all the various classes of NoSQL database and try to establish where they are the best fit.

Curriculum For This Mini Project

Introduction to the Yelp dataset
02m
Objectives of this project
03m
Introduction to the JSON schema
09m
Agenda
00m
Read the data and transform to Hive parquet table
06m
Ingest Json data using Spark
11m
Write to HDFS
09m
Integrate Hive with Spark
33m
Understanding Normalizing and Denormalizing
16m
Normalizing and Denormalizing datasets into Hive tables
39m
Transform the table and write in a single line
08m
Query to find users with more followers than their friends
05m
Error troubleshooting
01m
Initial import of data
16m
Exploring various data structures
19m
Exploring arrays
17m
Designing the analysis
17m