How to perform Joins in hive ?

This recipe helps you perform Joins in hive

Recipe Objective

In most of the big data scenarios , Hive is used for the processing of the different types of structured and semi-structured data. Joins are used to combine records from two or more tables in the hive database. A JOIN is a means for combining fields from two tables by using values common to each. Joins are the most important feature in data engineering SQL based analytics

ETL Orchestration on AWS using Glue and Step Functions

System requirements :

Step1 : Prepare the dataset

The dataset used here is authors and books related datasets for creating the hive tables in local to perform the different types of joins.

Use linux command to check data as follows:

head -5 /home/bigdata/Downloads/list_authors.txt head -5 /home/bigdata/Downloads/nameofbooks.txt

Data of Output looks as follows:

Before create a table open the hive shell , we need to create a database as follows Open the hive shell

To create database :

Create database dezyre_db; use dezyre_db;

Step 2 : Create a Hive Tables and Load the data into the tables and verify the data

Here we are going create a hive tables as follows

CREATE TABLE authors( id int, author_name string , country string ) ROW FORMAT DELIMITED fields terminated by ',' TBLPROPERTIES ("skip.header.line.count"="1");

Using below query we are going to load data into table as follow:

load data local inpath '/home/bigdata/Downloads/list_authors.txt' into table authors;

To verify the data :

select * from authors limit 3;

Create another table, load and verify the data

CREATE TABLE books( id int, book_title string ) ROW FORMAT DELIMITED fields terminated by ',' TBLPROPERTIES ("skip.header.line.count"="1");

Using below query we are going to load data into table as follow:

load data local inpath '/home/bigdata/Downloads/namesofbooks.txt' into table books;

To verify the data use below query:

select * from books limit 3;

Step 3 : To perform different joins on the tables

INNER JOIN : The inner join will give the all the matched records from left and right tables if there is a match between the columns

To perform the inner join operation by using the following below command :

select a1.author_name, b1.book_title from authors a1 join books b1 on a1.id= b1.id;

Query runs as follows and the run below output:

LEFT OUTER JOIN : Left outer join will give you the all the values which are matched from the right table and left table all values

To perform the left outer join operation by using the following command: -

select a1.author_name, b1.book_title from authors a1 left outer join books b1 on a1.id= b1.id;

Query runs as follows and the run below output:

RIGHT OUTER JOIN : The right outer join will give you all the values which are matched from the left table and all values from the right table.

To perform the right outer join operation by using the following command: -

select a1.author_name, b1.book_title from authors a1 right outer join books b1 on a1.id= b1.id;

Query runs as follows and the run below output:

FULL OUTER JOIN : The full outer join will give all records from when there is a match in left or right.

To perform the full outer join operation by using the following command: -

select a1.author_name, b1.book_title from authors a1 full outer join books b1 on a1.id= b1.id;

Query runs as follows and the run below output:

What Users are saying..

profile image

Jingwei Li

Graduate Research assistance at Stony Brook University
linkedin profile url

ProjectPro is an awesome platform that helps me learn much hands-on industrial experience with a step-by-step walkthrough of projects. There are two primary paths to learn: Data Science and Big Data.... Read More

Relevant Projects

Machine Learning project for Retail Price Optimization
In this machine learning pricing project, we implement a retail price optimization algorithm using regression trees. This is one of the first steps to building a dynamic pricing model.

Isolation Forest Model and LOF for Anomaly Detection in Python
Credit Card Fraud Detection Project - Build an Isolation Forest Model and Local Outlier Factor (LOF) in Python to identify fraudulent credit card transactions.

NLP Project on LDA Topic Modelling Python using RACE Dataset
Use the RACE dataset to extract a dominant topic from each document and perform LDA topic modeling in python.

Deploying Machine Learning Models with Flask for Beginners
In this MLOps on GCP project you will learn to deploy a sales forecasting ML Model using Flask.

NLP Project to Build a Resume Parser in Python using Spacy
Use the popular Spacy NLP python library for OCR and text classification to build a Resume Parser in Python.

Build a Text Generator Model using Amazon SageMaker
In this Deep Learning Project, you will train a Text Generator Model on Amazon Reviews Dataset using LSTM Algorithm in PyTorch and deploy it on Amazon SageMaker.

Azure Text Analytics for Medical Search Engine Deployment
Microsoft Azure Project - Use Azure text analytics cognitive service to deploy a machine learning model into Azure Databricks

Deploy Transformer BART Model for Text summarization on GCP
Learn to Deploy a Machine Learning Model for the Abstractive Text Summarization on Google Cloud Platform (GCP)

Learn How to Build PyTorch Neural Networks from Scratch
In this deep learning project, you will learn how to build PyTorch neural networks from scratch.

Build an Image Classifier for Plant Species Identification
In this machine learning project, we will use binary leaf images and extracted features, including shape, margin, and texture to accurately identify plant species using different benchmark classification techniques.