Explain PySpark Union() and UnionAll() Functions

This recipe will cover the essentials of PySpark Union() and UnionAll() functions, mastering these powerful tools for data consolidation and analysis. | ProjectPro
Last Updated: 06 Feb 2024

Get access to Big Data projects View all Big Data projects

PYSPARK PROJECTS DATA CLEANING PYTHON DATA MUNGING MACHINE LEARNING RECIPES PANDAS CHEATSHEET ALL TAGS

Recipe Objective - Explain PySpark Union() and UnionAll() functions

In PySpark, the Dataframe union() function/method of the DataFrame is widely used and is defined as the function used to merge two DataFrames of the same structure or the schema. In the union() function, if the schemas are not the same then it returns an error. The DataFrame unionAll() function or the method of the data frame is widely used and is deprecated since the Spark ``2.0.0” version and is further replaced with union(). The PySpark union() and unionAll() transformations are being used to merge the two or more DataFrame’s of the same schema or the structure. The union() function eliminates the duplicates but unionAll() function merges the /two datasets including the duplicate records in other SQL languages. The Apache PySpark Resilient Distributed Dataset(RDD) Transformations are defined as the spark operations that when executed on the Resilient Distributed Datasets(RDD), further results in the single or the multiple new defined RDD’s. As the RDD mostly are immutable, the transformations always create the new RDD without updating an existing RDD, which results in the creation of an RDD lineage. RDD Lineage is defined as the RDD operator graph or the RDD dependency graph. RDD Transformations are also defined as lazy operations in which none of the transformations get executed until an action is called from the user.

Data Ingestion with SQL using Google Cloud Dataflow

System Requirements

Python (3.0 version)
Apache Spark (3.1.1 version)

This recipe explains what the union() and unionAll() functions and explains their usage in PySpark.

PySpark Union() Function

The union() function in PySpark is used to combine the rows of two DataFrames with the same schema. Unlike unionAll(),union() performs a distinct operation on the DataFrames, removing any duplicate rows. This function is particularly useful when you want to merge datasets while ensuring unique records.

Syntax of PySpark Union()

DataFrame.union(other)

DataFrame: The DataFrame on which the union operation is performed.

other: The DataFrame to be combined with the first one.

PySpark UnionAll() Function

The unionAll()function in PySpark combines the rows of two DataFrames without eliminating duplicates. It's a straightforward concatenation of rows from both DataFrames.

Syntax of PySpark UnionAll()

DataFrame.unionAll(other)

DataFrame: The DataFrame on which the union operation is performed.

other: The DataFrame to be combined with the first one.

PySpark Project to Learn Advanced DataFrame Concepts

How to Implement the PySpark union() and unionAll() functions in Databricks?

Let’s now understand the implementation and usage of union and union all in PySpark with the following practical example -

# Importing packages

import pyspark

from pyspark.sql import SparkSession

Importing packages

The Sparksession is imported in the environment so as to use union() and unionAll() functions in the PySpark .

# Implementing the union() and unionAll() functions in Databricks in PySpark

spark = SparkSession.builder.appName('Select Column PySpark').getOrCreate()

sample_Data = [("Ram","Sales","DL",100000,44,20000), \

("Shyam","Sales","DL",96000,46,30000), \

("Amit","Sales","RJ",91000,40,33000), \

("Ghanshyam","Finance","KA",80000,34,33000) \

]

sample_columns= ["employee_name","department","state","salary","age","bonus"]

dataframe = spark.createDataFrame(data = sample_Data, schema = sample_columns)

dataframe.printSchema()

dataframe.show(truncate=False)

sample_Data2 = [("Ram","Sales","DL",110000,44,20000), \
("Ghanshyam","Finance","KA",80000,34,43000), \
("Pooja","Finance","DL",89000,43,25000), \

("Gauri","Marketing","KA",90000,35,28000), \

("Payal","Marketing","DL",81000,40,11000) \

]

sample_columns2= ["employee_name","department","state","salary","age","bonus"]

dataframe2 = spark.createDataFrame(data = sample_Data2, schema = sample_columns2)

dataframe2.printSchema()

dataframe2.show(truncate=False)

# Using union() function

union_DF = dataframe.union(dataframe2)

union_DF.show(truncate=False)

dis_DF = dataframe.union(dataframe2).distinct()

dis_DF.show(truncate=False)

# Using unionAll() function

unionAll_DF = dataframe.unionAll(dataframe2)

unionAll_DF.show(truncate=False)

Importing packages

How to Implement PySpark Union and Union all

Output Image

Output Image of PySpark Union

Output Image

The Spark Session is defined. The "sample_Data" and "sample_columns'' are defined. Further, the DataFrame ``data frame" is defined using data and columns. The "sample_data2" and "sample_columns2" are defined. Further, the "dataframe2" is defined using data and columns. The dataframe1 and dataframe2 are merged using the union() function. The distinct keyword is used to return just one record when a duplicate exists using the distinct() function.

Use Cases: Union and Union All in PySpark

Union(): Use this when you want to merge datasets and ensure unique records.

UnionAll(): Use this when you need a straightforward concatenation of rows, including duplicates.

Master the Practical Implementation of PySpark Functions with ProjectPro!

PySpark union() and unionAll() functions open up a world of possibilities for seamless data manipulation. Through this exploration, we've gained a comprehensive understanding of when to deploy these functions based on distinct use cases. However, transitioning from theoretical knowledge to practical expertise is where true mastery is achieved. This is where ProjectPro, a one-stop platform for data science and big data projects, comes into play. With a repository boasting over 270+ projects, ProjectPro provides the hands-on experience necessary to solidify your understanding of PySpark and its diverse functionalities.

Download Materials

Databricks_1

Databricks_2

Databricks_3

Databricks_4

Databricks_5

Databricks_6

Databricks_7

What Users are saying..

Ed Godalle

Director Data Analytics at EY / EY Tech

I am the Director of Data Analytics with over 10+ years of IT experience. I have a background in SQL, Python, and Big Data working with Accenture, IBM, and Infosys. I am looking to enhance my skills... Read More

Relevant Projects

Machine Learning Projects

Data Science Projects

Python Projects for Data Science

Data Science Projects in R

Machine Learning Projects for Beginners

Deep Learning Projects

Neural Network Projects

Tensorflow Projects

NLP Projects

Kaggle Projects

IoT Projects

Big Data Projects

Hadoop Real-Time Projects Examples

Spark Projects

Data Analytics Projects for Students

Relevant Projects

Yelp Data Processing using Spark and Hive Part 2

In this spark project, we will continue building the data warehouse from the previous project Yelp Data Processing Using Spark And Hive Part 1 and will do further data processing to develop diverse data products.

View Project Details

Build a Data Pipeline in AWS using NiFi, Spark, and ELK Stack

In this AWS Project, you will learn how to build a data pipeline Apache NiFi, Apache Spark, AWS S3, Amazon EMR cluster, Amazon OpenSearch, Logstash and Kibana.

View Project Details

Hadoop Project-Analysis of Yelp Dataset using Hadoop Hive

The goal of this hadoop project is to apply some data engineering principles to Yelp Dataset in the areas of processing, storage, and retrieval.

View Project Details

Talend Real-Time Project for ETL Process Automation

In this Talend Project, you will learn how to build an ETL pipeline in Talend Open Studio to automate the process of File Loading and Processing.

View Project Details

Build a Real-Time Spark Streaming Pipeline on AWS using Scala

In this Spark Streaming project, you will build a real-time spark streaming pipeline on AWS using Scala and Python.

View Project Details

dbt Snowflake Project to Master dbt Fundamentals in Snowflake

DBT Snowflake Project to Master the Fundamentals of DBT and learn how it can be used to build efficient and robust data pipelines with Snowflake.

View Project Details

Build an ETL Pipeline with Talend for Export of Data from Cloud

In this Talend ETL Project, you will build an ETL pipeline using Talend to export employee data from the Snowflake database and investor data from the Azure database, combine them using a Loop-in mechanism, filter the data for each sales representative, and export the result as a CSV file.

View Project Details

Explain PySpark Union() and UnionAll() Functions

Recipe Objective - Explain PySpark Union() and UnionAll() functions

System Requirements

PySpark Union() Function

Syntax of PySpark Union()

PySpark UnionAll() Function

Syntax of PySpark UnionAll()

How to Implement the PySpark union() and unionAll() functions in Databricks?

Use Cases: Union and Union All in PySpark

Master the Practical Implementation of PySpark Functions with ProjectPro!

Ed Godalle

Relevant Projects

You might also like

Relevant Projects