Explain with example how to perform subquery union and union all?
BIG DATA RECIPES DATA CLEANING PYTHON DATA MUNGING MACHINE LEARNING RECIPES PANDAS CHEATSHEET     ALL TAGS

Explain with example how to perform subquery union and union all?

Explain with example how to perform subquery union and union all?

This recipe explains what with example how to perform subquery union and union all

Recipe Objective

In most of the big data scenarios , Hive is used as an ETL and data warehouse tool on top of the hadoop ecosystem, it is used for the processing of the different types structured and semi-structured data, it is a database. Hive gives you data summarization, querying , and analysis of data. To querying hive we will you use a query language called as HiveQL(Hive Query Language).Hive allows you to store a structured and unstructured data..After you define the structure, you can use HiveQL to query the data

System requirements :

Step 1 : Prepare the dataset

Here we are using the salesman and orders related datasets for the create hive tables in local to perform subqueries on the tables.

Use linux command to check data as follows:

head -5 /home/bigdata/Downloads/salesman.txt head -5 /home/bigdata/Downloads/orders.txt

Data of Output looks as follows:

Before create a table open the hive shell, we need to create a database as follows Open the hive shell

To create database :

Create database dezyre_db; use dezyre_db;

Step 2 : Create a Hive Tables and Load the data into the tables and verify the data

Here we are going create a hive tables as follows

CREATE TABLE salesman( salesman_id int, name string, city string, commission float ) ROW FORMAT DELIMITED fields terminated by ',' TBLPROPERTIES ("skip.header.line.count"="1");

Using below query we are going to load data into table as follow:

load data local inpath '/home/bigdata/Downloads/salesman.txt' into table salesman;

To verify the data :

select * from salesman limit 3;

Create another table, load and verify the data

CREATE TABLE orders( ord_no int, purch_amt string, ord_date string, customer_id int, Salesman_id int ) ROW FORMAT DELIMITED fields terminated by ',' TBLPROPERTIES ("skip.header.line.count"="1");

Using below query we are going to load data into table as follow:

load data local inpath '/home/bigdata/Downloads/orders.txt' into table orders;

To verify the data use below query:

select * from orders limit 3;

Step 3 : To perform a Subquery

Hive supports subqueries in FROM clauses and in WHERE clauses of SQL statements A hive subquery is a select expression it is enclosed as parentheses. In this scenario we are going to Display all the orders from the orders table issued by a particular salesman as below.

SELECT * FROM orders WHERE salesman_id = (SELECT salesman_id FROM salesman WHERE name='Paul Adam');

The above query runs mapreduce job to perform a query as below :

Output of the above query :

Step 4 : Perform a Union on two tables :

The UNION operator is used to combine the result-set of two or more SELECT statements. UNION operation eliminates the duplicated rows from the result set but UNION ALL returns all rows after joining.

In this scenario we are going to perform union operator on the two hive tables getting the id's from the table

SELECT salesman_id FROM salesman UNION SELECT salesman_id FROM orders ORDER BY salesman_id ;

The above union set operator runs as below

Output of the above query :

Step 5 : Perform a Union all on two tables :

UNION ALL is faster than UNION, but to remove duplicates the result set must be sorted, and this may have an impact on the performance of the UNION, depending on the volume of data being sorted.UNION result set is sorted in ascending order whereas UNION ALL Result set is not sorted

In this scenario we are going to perform union all operator on the two hive tables getting the id's from the table

SELECT salesman_id FROM salesman UNION ALL SELECT salesman_id FROM orders ORDER BY salesman_id ;

The above union set operator runs as below

Output of the above query:

Relevant Projects

Tough engineering choices with large datasets in Hive Part - 2
This is in continuation of the previous Hive project "Tough engineering choices with large datasets in Hive Part - 1", where we will work on processing big data sets using Hive.

Slow Changing Dimensions for Warehouse
One of the broadest uses of Snowflake is building a data warehouse platform or enhancing the existing data lake. It offers all sorts of services to build an efficient Data warehouse with ETL capability and support for various external data partners. Slowly Changing dimensions are a common database modeling technique used to capture data in a table and show how it changes over time. The slowly changing dimension of the warehouse dimension is said to rarely change. However, when they change, there should be a systematic approach to capturing that change. Examples of SCDs are customer and products information. This project explains how to build a Slowly Changing Dimension (SCD) using Snowflake’s Stream functionality and how to automate the process using Snowflake’s Task functionality.

Tough engineering choices with large datasets in Hive Part - 1
Explore hive usage efficiently in this hadoop hive project using various file formats such as JSON, CSV, ORC, AVRO and compare their relative performances

Real-Time Streaming of Twitter Sentiments AWS EC2 NiFi
Learn to perform 1) Twitter Sentiment Analysis using Spark Streaming, NiFi and Kafka, and 2) Build an Interactive Data Visualization for the analysis using Python Plotly.

Movielens dataset analysis for movie recommendations using Spark in Azure
In this Databricks Azure tutorial project, you will use Spark Sql to analyse the movielens dataset to provide movie recommendations. As part of this you will deploy Azure data factory, data pipelines and visualise the analysis.

Yelp Data Processing using Spark and Hive Part 2
In this spark project, we will continue building the data warehouse from the previous project Yelp Data Processing Using Spark And Hive Part 1 and will do further data processing to develop diverse data products.

GCP Data Ingestion with SQL using Google Cloud Dataflow
In this GCP Project, you will learn to build a data processing pipeline With Apache Beam, Dataflow & BigQuery on GCP using Yelp Dataset.

Online Hadoop Projects -Solving small file problem in Hadoop
In this hadoop project, we are going to be continuing the series on data engineering by discussing and implementing various ways to solve the hadoop small file problem.

AWS Project-Website Monitoring using AWS Lambda and Aurora
In this AWS Project, you will learn the best practices for website monitoring using AWS services like Lambda, Aurora MySQL, Amazon Dynamo DB and Kinesis.

Orchestrate Glue Spark Jobs using Step Functions to load Redshift
This is a typical Big Data ETL visualization project implemented in AWS cloud using cloud native tools like Glue which is used to Spark jobs without maintaining cluster infrastructure, Step Functions which is used to schedule jobs based on dependency ,Redshift which is the ultimate petabyte scale data warehouse solution in AWS and Quicksight which is AWS managed Visualization tool to create business reports