Hadoop Project-Analysis of Yelp Dataset using Hadoop Hive

The goal of this hadoop project is to apply some data engineering principles to Yelp Dataset in the areas of processing, storage, and retrieval.

What will you learn

  • Making decision on data storage and access
  • How to process and store variety of data format
  • Avoiding Hadoop's small file problems
  • Process and storing binary content
  • Provisioning access of data using hive/impala
  • Serving layer vs Batch layer (Neo4j vs HDFS)

What will you get

  • Access to recording of the complete project
  • Access to all material related to project like data files, solution files etc.


Project Description

Data engineering is the science of acquiring, aggregating or collection, processing, and storage of data either in batch or in real-time as well as providing the variety of means of serving these data to other users which could include a data scientist. It involves software engineering practices on big data.

The goal of this big data project is apply data engineering principles to the Yelp Dataset in the areas of processing, storage, and retrieval. We will not include data ingestion since we are already downloading the data from the yelp challenge website.



Big Data & Enterprise Software Engineer

I am passionate about software development, databases, data analysis and the android platform. My native language is java but no one has stopped me so far from learning and using angular and node.js. Data and data analysis is thrilling and so are my experiences with SQL on Oracle, Microsoft SQL Server, Postgres and MyS see more...