Hadoop Project-Analysis of Yelp Dataset using Hadoop Hive

The goal of this hadoop project is to apply some data engineering principles to Yelp Dataset in the areas of processing, storage, and retrieval.
Videos
Each project comes with 2-5 hours of micro-videos explaining the solution.
Code & Dataset
Get access to 50+ solved projects with iPython notebooks and datasets.
Project Experience
Add project experience to your Linkedin/Github profiles.

What will you learn

  • Making decision on data storage and access
  • How to process and store variety of data format
  • Avoiding Hadoop's small file problems
  • Process and storing binary content
  • Provisioning access of data using hive/impala
  • Serving layer vs Batch layer (Neo4j vs HDFS)

Project Description

Data engineering is the science of acquiring, aggregating or collection, processing, and storage of data either in batch or in real-time as well as providing the variety of means of serving these data to other users which could include a data scientist. It involves software engineering practices on big data.

The goal of this big data project is apply data engineering principles to the Yelp Dataset in the areas of processing, storage, and retrieval. We will not include data ingestion since we are already downloading the data from the yelp challenge website.

Curriculum For This Mini Project

 
  Overview
05m
  What-is-Data-Engineering ?
08m
  The Yelp Dataset
03m
  Dataset schema and Job roles
13m
  Data format and storage
07m
  Data processing tools
10m
  Hadoop small file problem
15m
  Example - Hadoop small file problem
08m
  Data provisioning
07m
  Data sampling and understanding - 1
14m
  Create database tables
05m
  Parquet versus Avro
19m
  Data Analysis
36m
  Data Modelling
29m
  Complex cases
16m