How to perform Joins in hive ?
BIG DATA RECIPES DATA CLEANING PYTHON DATA MUNGING MACHINE LEARNING RECIPES PANDAS CHEATSHEET     ALL TAGS

How to perform Joins in hive ?

How to perform Joins in hive ?

This recipe helps you perform Joins in hive

0

Recipe Objective

In most of the big data scenarios , Hive is used for the processing of the different types of structured and semi-structured data. Joins are used to combine records from two or more tables in the hive database. A JOIN is a means for combining fields from two tables by using values common to each. Joins are the most important feature in data engineering SQL based analytics

System requirements :

Step1 : Prepare the dataset

The dataset used here is authors and books related datasets for creating the hive tables in local to perform the different types of joins.

Use linux command to check data as follows:

head -5 /home/bigdata/Downloads/list_authors.txt head -5 /home/bigdata/Downloads/nameofbooks.txt

Data of Output looks as follows:

Before create a table open the hive shell , we need to create a database as follows Open the hive shell

To create database :

Create database dezyre_db; use dezyre_db;

Step 2 : Create a Hive Tables and Load the data into the tables and verify the data

Here we are going create a hive tables as follows

CREATE TABLE authors( id int, author_name string , country string ) ROW FORMAT DELIMITED fields terminated by ',' TBLPROPERTIES ("skip.header.line.count"="1");

Using below query we are going to load data into table as follow:

load data local inpath '/home/bigdata/Downloads/list_authors.txt' into table authors;

To verify the data :

select * from authors limit 3;

Create another table, load and verify the data

CREATE TABLE books( id int, book_title string ) ROW FORMAT DELIMITED fields terminated by ',' TBLPROPERTIES ("skip.header.line.count"="1");

Using below query we are going to load data into table as follow:

load data local inpath '/home/bigdata/Downloads/namesofbooks.txt' into table books;

To verify the data use below query:

select * from books limit 3;

Step 3 : To perform different joins on the tables

INNER JOIN : The inner join will give the all the matched records from left and right tables if there is a match between the columns

To perform the inner join operation by using the following below command :

select a1.author_name, b1.book_title from authors a1 join books b1 on a1.id= b1.id;

Query runs as follows and the run below output:

LEFT OUTER JOIN : Left outer join will give you the all the values which are matched from the right table and left table all values

To perform the left outer join operation by using the following command: -

select a1.author_name, b1.book_title from authors a1 left outer join books b1 on a1.id= b1.id;

Query runs as follows and the run below output:

RIGHT OUTER JOIN : The right outer join will give you all the values which are matched from the left table and all values from the right table.

To perform the right outer join operation by using the following command: -

select a1.author_name, b1.book_title from authors a1 right outer join books b1 on a1.id= b1.id;

Query runs as follows and the run below output:

FULL OUTER JOIN : The full outer join will give all records from when there is a match in left or right.

To perform the full outer join operation by using the following command: -

select a1.author_name, b1.book_title from authors a1 full outer join books b1 on a1.id= b1.id;

Query runs as follows and the run below output:

Relevant Projects

Web Server Log Processing using Hadoop
In this hadoop project, you will be using a sample application log file from an application server to a demonstrated scaled-down server log processing pipeline.

Finding Unique URL's using Hadoop Hive
Hive Project -Learn to write a Hive program to find the first unique URL, given 'n' number of URL's.

AWS Project - Build an ETL Data Pipeline on AWS EMR Cluster
Build a fully working scalable, reliable and secure AWS EMR complex data pipeline from scratch that provides support for all data stages from data collection to data analysis and visualization.

Explore features of Spark SQL in practice on Spark 2.0
The goal of this spark project for students is to explore the features of Spark SQL in practice on the latest version of Spark i.e. Spark 2.0.

Implementing Slow Changing Dimensions in a Data Warehouse using Hive and Spark
Hive Project- Understand the various types of SCDs and implement these slowly changing dimesnsion in Hadoop Hive and Spark.

Online Hadoop Projects -Solving small file problem in Hadoop
In this hadoop project, we are going to be continuing the series on data engineering by discussing and implementing various ways to solve the hadoop small file problem.

Yelp Data Processing Using Spark And Hive Part 1
In this big data project, we will continue from a previous hive project "Data engineering on Yelp Datasets using Hadoop tools" and do the entire data processing using spark.

Analysing Big Data with Twitter Sentiments using Spark Streaming
In this big data spark project, we will do Twitter sentiment analysis using spark streaming on the incoming streaming data.

PySpark Tutorial - Learn to use Apache Spark with Python
PySpark Project-Get a handle on using Python with Spark through this hands-on data processing spark python tutorial.

Tough engineering choices with large datasets in Hive Part - 1
Explore hive usage efficiently in this hadoop hive project using various file formats such as JSON, CSV, ORC, AVRO and compare their relative performances