Hadoop Project-Analysis of Yelp Dataset using Hadoop Hive

Hadoop Project-Analysis of Yelp Dataset using Hadoop Hive

The goal of this hadoop project is to apply some data engineering principles to Yelp Dataset in the areas of processing, storage, and retrieval.


Each project comes with 2-5 hours of micro-videos explaining the solution.

Code & Dataset

Get access to 50+ solved projects with iPython notebooks and datasets.

Project Experience

Add project experience to your Linkedin/Github profiles.

Customer Love

Read All Reviews

Swati Patra

Systems Advisor , IBM

I have 11 years of experience and work with IBM. My domain is Travel, Hospitality and Banking - both sectors process lots of data. The way the projects were set up and the mentors' explanation was... Read More

Camille St. Omer

Artificial Intelligence Researcher, Quora 'Most Viewed Writer in 'Data Mining'

I came to the platform with no experience and now I am knowledgeable in Machine Learning with Python. No easy thing I must say, the sessions are challenging and go to the depths. I looked at graduate... Read More

What will you learn

Understanding Data Engineering, different roles, and tools used
Understanding the Yelp dataset
What is Dataset schema and how to create your own schema
Tools to be used during data processing
What is Hadoop small file problem and how to solve them
Understanding Hadoop Small file problem using and example
How to use the Online system instead of Batch System
Serving layer vs Batch layer (Neo4j vs HDFS)
Data Sampling and Understanding
Understanding database tables and creating them in HDFS
Provisioning access to data using hive/impala
Selecting Parquet or Avro for creating Schemes for my Data
Performing Data analysis and Data modeling on the dataset
Solving Complex Cases in Hadoop

Project Description

Data engineering is the science of acquiring, aggregating or collection, processing, and storage of data either in batch or in real-time as well as providing the variety of means of serving these data to other users which could include a data scientist. It involves software engineering practices on big data.

The goal of this big data project is apply data engineering principles to the Yelp Dataset in the areas of processing, storage, and retrieval. We will not include data ingestion since we are already downloading the data from the yelp challenge website.

Similar Projects

In this hive project, you will design a data warehouse for e-commerce environments.

In this project, we will evaluate and demonstrate how to handle unstructured data using Spark.

In this big data project, we will embark on real-time data collection and aggregation from a simulated real-time system using Spark Streaming.

Curriculum For This Mini Project

What-is-Data-Engineering ?
The Yelp Dataset
Dataset schema and Job roles
Data format and storage
Data processing tools
Hadoop small file problem
Example - Hadoop small file problem
Data provisioning
Data sampling and understanding - 1
Create database tables
Parquet versus Avro
Data Analysis
Data Modelling
Complex cases