Explain with example how to perform subquery union and union all?
BIG DATA RECIPES DATA CLEANING PYTHON DATA MUNGING MACHINE LEARNING RECIPES PANDAS CHEATSHEET     ALL TAGS

Explain with example how to perform subquery union and union all?

Explain with example how to perform subquery union and union all?

This recipe explains what with example how to perform subquery union and union all

0

Recipe Objective

In most of the big data scenarios , Hive is used as an ETL and data warehouse tool on top of the hadoop ecosystem, it is used for the processing of the different types structured and semi-structured data, it is a database. Hive gives you data summarization, querying , and analysis of data. To querying hive we will you use a query language called as HiveQL(Hive Query Language).Hive allows you to store a structured and unstructured data..After you define the structure, you can use HiveQL to query the data

System requirements :

Step 1 : Prepare the dataset

Here we are using the salesman and orders related datasets for the create hive tables in local to perform subqueries on the tables.

Use linux command to check data as follows:

head -5 /home/bigdata/Downloads/salesman.txt head -5 /home/bigdata/Downloads/orders.txt

Data of Output looks as follows:

Before create a table open the hive shell, we need to create a database as follows Open the hive shell

To create database :

Create database dezyre_db; use dezyre_db;

Step 2 : Create a Hive Tables and Load the data into the tables and verify the data

Here we are going create a hive tables as follows

CREATE TABLE salesman( salesman_id int, name string, city string, commission float ) ROW FORMAT DELIMITED fields terminated by ',' TBLPROPERTIES ("skip.header.line.count"="1");

Using below query we are going to load data into table as follow:

load data local inpath '/home/bigdata/Downloads/salesman.txt' into table salesman;

To verify the data :

select * from salesman limit 3;

Create another table, load and verify the data

CREATE TABLE orders( ord_no int, purch_amt string, ord_date string, customer_id int, Salesman_id int ) ROW FORMAT DELIMITED fields terminated by ',' TBLPROPERTIES ("skip.header.line.count"="1");

Using below query we are going to load data into table as follow:

load data local inpath '/home/bigdata/Downloads/orders.txt' into table orders;

To verify the data use below query:

select * from orders limit 3;

Step 3 : To perform a Subquery

Hive supports subqueries in FROM clauses and in WHERE clauses of SQL statements A hive subquery is a select expression it is enclosed as parentheses. In this scenario we are going to Display all the orders from the orders table issued by a particular salesman as below.

SELECT * FROM orders WHERE salesman_id = (SELECT salesman_id FROM salesman WHERE name='Paul Adam');

The above query runs mapreduce job to perform a query as below :

Output of the above query :

Step 4 : Perform a Union on two tables :

The UNION operator is used to combine the result-set of two or more SELECT statements. UNION operation eliminates the duplicated rows from the result set but UNION ALL returns all rows after joining.

In this scenario we are going to perform union operator on the two hive tables getting the id's from the table

SELECT salesman_id FROM salesman UNION SELECT salesman_id FROM orders ORDER BY salesman_id ;

The above union set operator runs as below

Output of the above query :

Step 5 : Perform a Union all on two tables :

UNION ALL is faster than UNION, but to remove duplicates the result set must be sorted, and this may have an impact on the performance of the UNION, depending on the volume of data being sorted.UNION result set is sorted in ascending order whereas UNION ALL Result set is not sorted

In this scenario we are going to perform union all operator on the two hive tables getting the id's from the table

SELECT salesman_id FROM salesman UNION ALL SELECT salesman_id FROM orders ORDER BY salesman_id ;

The above union set operator runs as below

Output of the above query:

Relevant Projects

Tough engineering choices with large datasets in Hive Part - 1
Explore hive usage efficiently in this hadoop hive project using various file formats such as JSON, CSV, ORC, AVRO and compare their relative performances

Spark Project-Analysis and Visualization on Yelp Dataset
The goal of this Spark project is to analyze business reviews from Yelp dataset and ingest the final output of data processing in Elastic Search.Also, use the visualisation tool in the ELK stack to visualize various kinds of ad-hoc reports from the data.

Analyse Yelp Dataset with Spark & Parquet Format on Azure Databricks
In this Databricks Azure project, you will use Spark & Parquet file formats to analyse the Yelp reviews dataset. As part of this you will deploy Azure data factory, data pipelines and visualise the analysis.

Event Data Analysis using AWS ELK Stack
This Elasticsearch example deploys the AWS ELK stack to analyse streaming event data. Tools used include Nifi, PySpark, Elasticsearch, Logstash and Kibana for visualisation.

Real-time Auto Tracking with Spark-Redis
Spark Project - Discuss real-time monitoring of taxis in a city. The real-time data streaming will be simulated using Flume. The ingestion will be done using Spark Streaming.

PySpark Tutorial - Learn to use Apache Spark with Python
PySpark Project-Get a handle on using Python with Spark through this hands-on data processing spark python tutorial.

Create A Data Pipeline Based On Messaging Using PySpark And Hive - Covid-19 Analysis
In this PySpark project, you will simulate a complex real-world data pipeline based on messaging. This project is deployed using the following tech stack - NiFi, PySpark, Hive, HDFS, Kafka, Airflow, Tableau and AWS QuickSight.

Real-Time Log Processing using Spark Streaming Architecture
In this Spark project, we are going to bring processing to the speed layer of the lambda architecture which opens up capabilities to monitor application real time performance, measure real time comfort with applications and real time alert in case of security

Analysing Big Data with Twitter Sentiments using Spark Streaming
In this big data spark project, we will do Twitter sentiment analysis using spark streaming on the incoming streaming data.

Real-Time Log Processing in Kafka for Streaming Architecture
The goal of this apache kafka project is to process log entries from applications in real-time using Kafka for the streaming architecture in a microservice sense.