Each project comes with 2-5 hours of micro-videos explaining the solution.
Code & Dataset
Get access to 50+ solved projects with iPython notebooks and datasets.
Add project experience to your Linkedin/Github profiles.
Introduction to the dataset and objectives of this project
What are JSON types of file and the data schema of JSON
Reading the Data and transforming it into Hive table
What is Ingestion and How to do Ingestion of data using Spark
How to create data storing and distribution center
Creating customized HDFS and saving data
Various ways of integrating Hive with Spark
Saving a file and building a Hive table on output
What are Normalization and Denormalization and its use
Normalizing and denormalizing dataset into hive tables
Joining different datasets
Various complex data structures in Hive through spark
Transforming the size of a table
Writing customized Query in Hive performing Self joins among tables
Various Complex Data Structure in Hive through Spark
Understanding the arrays and designing the final analysis
Data engineering is the science of acquiring, aggregating or collection, processing and storage of data either in batch or in real time as well as providing variety of means of serving these data to other users which could include a data scientist. It involves software engineering practises on big data.
In this big data project for beginners, we will continue from a previous hive project on "Data engineering on Yelp Datasets using Hadoop tools" where we applied some data engineering principles to the Yelp Dataset in the areas of processing, storage and retrieval. Like in that session, We will not include data ingestion since we are already downloading the data from the yelp challenge website. But unlike that session, we will focus on doing the entire data processing using spark.