Tough engineering choices with large datasets in Hive Part - 1

Explore hive usage efficiently in this hadoop hive project using various file formats such as JSON, CSV, ORC, AVRO and compare their relative performances


Each project comes with 2-5 hours of micro-videos explaining the solution.

Code & Dataset

Get access to 50+ solved projects with iPython notebooks and datasets.

Project Experience

Add project experience to your Linkedin/Github profiles.

What will you learn

  • Common misuse/abuse of hive

  • How to use and interpret Hive's explain command

  • File formats and their relative performance (Text, JSON, SequenceFile, Avro, ORC and Parquet)

  • Compression

  • Spark and hive for transformation

  • Hive and Impala - making choices

  • Execution engines and performance

Project Description

The use of Hive or the hive meta-store is so ubiquitous in big data engineering that achieving efficient use of the tool is a factor in the success of many big data projects. Whether in integrating with Spark or using hive as an ETL tool, many big data projects either fail or succeed as they grow in scale and complexity because of decisions made in the early lifecycle of the analytics project.

In this hive project, we will explore using hive efficiently and this big data project format will take an exploratory pattern rather than a project building pattern. The goal of these sessions will be to explore Hive in uncommon ways towards mastery.

We will be using different sample dataset for hive in the series of these hive real time projects, exploring different Hadoop file formats like text, CSV, JSON, ORC, parquet, AVRO and sequence file, will look at compression and different codecs and take a look at the performance of each when you try integration with either spark or impala. The idea of this hadoop hive project is to explore enough so that we can be made a reasonable argument about what to do or not in any given scenario.

Similar Projects

Big Data Project Big Data Hadoop Project-Visualize Daily Wikipedia Trends
In this big data project, we'll work with Apache Airflow and write scheduled workflow, which will download data from Wikipedia archives, upload to S3, process them in HIVE and finally analyze on Zeppelin Notebooks.
Big Data Project Spark Project -Real-time data collection and Spark Streaming Aggregation
In this big data project, we will embark on real-time data collection and aggregation from a simulated real-time system using Spark Streaming.
Big Data Project Hive Project - Visualising Website Clickstream Data with Apache Hadoop
Analyze clickstream data of a website using Hadoop Hive to increase sales by optimizing every aspect of the customer experience on the website from the first mouse click to the last.
Big Data Project Design a Network Crawler by Mining Github Social Profiles
In this big data project, we will look at how to mine and make sense of connections in a simple way by building a Spark GraphX Algorithm and a Network Crawler.

Curriculum For This Mini Project

  Overview of the Project
  Datasets used for the Project
  Downloading IBM Analytics DemoCloud
  Logging to IBM Analytics DemoCloud
  Downloading Airline Ontime Performance Dataset
  Introduction to Hive
  General Discussion on the Purpose of the Project
  Agenda for the Project
  Star Schema
  Run Scripts to Create Database
  Data Exploration
  Data Analysis
  Why Hive still is the Swiss Army Knife of Big Data?
  Data Analysis Continuation
  Quick Recap of the Previous Session
  Use Hive Integration to read Data -Hive Metastore
  Partioning using HCatalog
  Partitioning -Alter, Drop, Move Partitions Notes
  Explain and Statistics
  Different Types of Explain