Tough engineering choices with large datasets in Hive Part - 2

Tough engineering choices with large datasets in Hive Part - 2

This is in continuation of the previous Hive project "Tough engineering choices with large datasets in Hive Part - 1", where we will work on processing big data sets using Hive.


Each project comes with 2-5 hours of micro-videos explaining the solution.

Code & Dataset

Get access to 50+ solved projects with iPython notebooks and datasets.

Project Experience

Add project experience to your Linkedin/Github profiles.

Customer Love

Read All Reviews

Arvind Sodhi

VP - Data Architect, CDO at Deutsche Bank

I have extensive experience in data management and data processing. Over the past few years I saw the data management technology transition into the Big Data ecosystem and I needed to follow suit. I... Read More

Ray Han

Tech Leader | Stanford / Yale University

I think that they are fantastic. I attended Yale and Stanford and have worked at Honeywell,Oracle, and Arthur Andersen(Accenture) in the US. I have taken Big Data and Hadoop,NoSQL, Spark, Hadoop... Read More

What will you learn

Understanding the RoadMap of the project
Common misuse/abuse of hive
Ordering, Clustering and Distributing dataset on different attributes
Understanding different types of Sampling Methods
Difference between random and Bucket Sampling and its implementations
Using different tools of Bigdata via Horton Sandbox
Installing Cloudera Dataflow(Ambari)
How to use and interpret Hive's explain command
Understanding Record-Level Sampling and Block-Level Sampling for computing clusters
Understanding Different Types of Hive execution engines(Tez, MapReduce, and Spark)
Integrating Hadoop Application natively with Apache Hadoop Yarn using Tez
Different types of Joins like Skew Join, Bucket Map Join etc.
File formats and their relative performance (Text, JSON, SequenceFile, Avro, ORC, and Parquet)
Compression and Compression Codec
Spark and hive for transformation
Understanding BitMap indexes in the context of Database
Hive and Impala - making choices

Project Description

The use of Hive or the hive meta store is so ubiquitous in big data engineering that achieving efficient use of the tool is a factor in the success of many projects. Whether in integrating with Spark or using hive as an ETL tool, many projects either fail or succeed as they grow in scale and complexity because of decisions made early in the project.

In this big data project on hive, we will explore using hive efficiently and this hive porject format will take an exploratory pattern rather than a project building pattern. The goal of this big data project is to explore Hive in uncommon ways towards mastery.

We will be using different datasets in this sessions, exploring different Hadoop file formats like text, CSV, JSON, ORC, Parquet, Avro, and SequenceFile, will look at compression and different codecs and take a look at the performance of each when you try integration with either spark or impala.

The idea is to explore enough so that we can make a reasonable argument about what to do or not in any given big scenario.

Similar Projects

In this Databricks Azure tutorial project, you will use Spark Sql to analyse the movielens dataset to provide movie recommendations. As part of this you will deploy Azure data factory, data pipelines and visualise the analysis.

In this PySpark project, you will simulate a complex real-world data pipeline based on messaging. This project is deployed using the following tech stack - NiFi, PySpark, Hive, HDFS, Kafka, Airflow, Tableau and AWS QuickSight.

Use the dataset on aviation for analytics to simulate a complex real-world big data pipeline based on messaging with AWS Quicksight, Druid, NiFi, Kafka, and Hive.

Curriculum For This Mini Project

Agenda for the Session
What is ordering?
Order By and Group By
Order By, Sort By, Distribute By and Cluster By
Sampling -Random and Bucket Sampling
Installing Hortonworks Sandbox and Ambari Overview
Block Sampling
Hive Execution Engines-Spark, Tez and MapReduce
Cross Join
Q & A
Recap of the Previous Session
Map Join and Reduce Join
Left Semi Join
Bucket Map Join
Sort Merge Bucket Map Join
Skew Join
Bitmap Indexes
File Types -Sequence, Avro,ORC, Parquet
Compression and Compression Codec
Interaction with Spark
Q & A