Tough engineering choices with large datasets in Hive Part - 1

Tough engineering choices with large datasets in Hive Part - 1

Explore hive usage efficiently in this hadoop hive project using various file formats such as JSON, CSV, ORC, AVRO and compare their relative performances


Each project comes with 2-5 hours of micro-videos explaining the solution.

Code & Dataset

Get access to 50+ solved projects with iPython notebooks and datasets.

Project Experience

Add project experience to your Linkedin/Github profiles.

Customer Love

Read All Reviews

Dhiraj Tandon

Solution Architect-Cyber Security at ColorTokens

My Interaction was very short but left a positive impression. I enrolled and asked for a refund since I could not find the time. What happened next: They initiated Refund immediately. Their... Read More

Mike Vogt

Information Architect at Bank of America

I have had a very positive experience. The platform is very rich in resources, and the expert was thoroughly knowledgeable on the subject matter - real world hands-on experience. I wish I had this... Read More

What will you learn

Understanding the road map of the project
Setting up virtual environment n Cloudera Quick VM ware
Downloading the Airtime Online performance Data
Understanding use of HIve as Transformational Layer program
Various uses of HIve (Partitioning, Clustering, Integration etc.)
Creating a Star Schema for the Dataset
Creating Database and tables in HQL
Performing Statistical Data Analysis and Visualizing the data
How to use and interpret the Hive's explain command
File formats and their relative performance (Text, JSON, SequenceFile, Avro, ORC, and Parquet)
Comparing Apache Hive, Apache Pig , Apache Spark and Hadoop Map Reduce
Understanding Distributed Computing via MapReduce
Spark and hive for transformation
Improving Performance of the Dataset using Partitioning
Using HCatalog to prevent Information lost during partitioning
Improving time Queries including sampling and Mapside by Clustering Method
Execution engines and performance

Project Description

The use of Hive or the hive meta-store is so ubiquitous in big data engineering that achieving efficient use of the tool is a factor in the success of many big data projects. Whether in integrating with Spark or using hive as an ETL tool, many big data projects either fail or succeed as they grow in scale and complexity because of decisions made in the early lifecycle of the analytics project.

In this hive project, we will explore using hive efficiently and this big data project format will take an exploratory pattern rather than a project building pattern. The goal of these sessions will be to explore Hive in uncommon ways towards mastery.

We will be using different sample dataset for hive in the series of these hive real time projects, exploring different Hadoop file formats like text, CSV, JSON, ORC, parquet, AVRO and sequence file, will look at compression and different codecs and take a look at the performance of each when you try integration with either spark or impala. The idea of this hadoop hive project is to explore enough so that we can be made a reasonable argument about what to do or not in any given scenario.

Similar Projects

In this PySpark project, you will simulate a complex real-world data pipeline based on messaging. This project is deployed using the following tech stack - NiFi, PySpark, Hive, HDFS, Kafka, Airflow, Tableau and AWS QuickSight.

In this hive project, you will work on denormalizing the JSON data and create HIVE scripts with ORC file format.

In this big data project, we will be performing an OLAP cube design using AdventureWorks database. The deliverable for this session will be to design a cube, build and implement it using Kylin, query the cube and even connect familiar tools (like Excel) with our new cube.

Curriculum For This Mini Project

Overview of the Project
Datasets used for the Project
Downloading IBM Analytics DemoCloud
Logging to IBM Analytics DemoCloud
Downloading Airline Ontime Performance Dataset
Introduction to Hive
General Discussion on the Purpose of the Project
Agenda for the Project
Star Schema
Run Scripts to Create Database
Data Exploration
Data Analysis
Why Hive still is the Swiss Army Knife of Big Data?
Data Analysis Continuation
Quick Recap of the Previous Session
Use Hive Integration to read Data -Hive Metastore
Partioning using HCatalog
Partitioning -Alter, Drop, Move Partitions Notes
Explain and Statistics
Different Types of Explain