Explain the unionByName function in PySpark in Databricks

This recipe explains what the unionByName function in PySpark in Databricks

Recipe Objective - Explain the unionByName() function in PySpark in Databricks?

In PySpark, the unionByName() function is widely used as the transformation to merge or union two DataFrames with the different number of columns (different schema) by passing the allowMissingColumns with the value true. The important difference between unionByName() function and the union() function is that this function resolves columns by the name (and not by the position). In other words, the unionByName() function is used to merge two DataFrame’s by the column names instead of by the position. The Apache PySpark Resilient Distributed Dataset(RDD) Transformations are defined as the spark operations that is when executed on the Resilient Distributed Datasets(RDD), it further results in the single or the multiple new defined RDD’s. As the RDD mostly are immutable so, the transformations always create the new RDD without updating an existing RDD so, which results in the creation of an RDD lineage. RDD Lineage is defined as the RDD operator graph or the RDD dependency graph. RDD Transformations are also defined as lazy operations that are none of the transformations get executed until an action is called from the user.

Build a Real-Time Dashboard with Spark, Grafana and Influxdb

System Requirements

This recipe explains what is unionByName() function and explaining the usage of unionByName() in PySpark.

Learn to Transform your data pipeline with Azure Data Factory!

Implementing the unionByName() function in Databricks in PySpark

# Importing packages
import pyspark
from pyspark.sql import SparkSession, Row
from pyspark.sql.functions import col, lit
Databricks-1

The Sparksession, Row, col, lit is imported in the environment to use unionByName() function in the PySpark .

# Implementing the unionByName() function in Databricks in PySpark
spark = SparkSession.builder.appName('unionByName() PySpark').getOrCreate()
# Creating dataframe1
sample_data = [("Ram","Sales",44), ("Shyam","Sales",46),
("Amit","Sales",40), ("Rahul","Finance",34) ]
sample_columns= ["name","dept","age"]
dataframe1 = spark.createDataFrame(data = sample_data, schema = sample_columns)
dataframe1.printSchema()
# Creating dataframe2
sample_data2=[("Ram","Sales","RJ",8000),("Shyam","Finance","DL",8000),
("Amit","Finance","RJ",8900),("Rahul","Marketing","DL",9000)]
sample_columns2= ["name","dept","state","salary"]
dataframe2 = spark.createDataFrame(data = sample_data2, schema = sample_columns2)
dataframe2.printSchema()
# Adding missing columns 'state' & 'salary' to dataframe1
for column in [column for column in dataframe2.columns if column not in dataframe1.columns]:
dataframe1 = dataframe1.withColumn(column, lit(None))
# Adding missing column 'age' to dataframe2
for column in [column for column in dataframe1.columns if column not in dataframe2.columns]:
dataframe2 = dataframe2.withColumn(column, lit(None))
# Merging two dataframe's dataframe1 & dataframe2 by name
merged_dataframe = dataframe1.unionByName(dataframe2)
merged_dataframe.show()
Databricks-2

Databricks-3
Databricks-4

The Spark Session is defined. The "sample_data" is defined. Further, the DataFrame "dataframe1" is defined with the missing column state and salary. The "sample_data2" is defined. Further, the "dataframe2" is defined with the missing column age. The dataframe1 and dataframe2 are printed using the printSchema() function. Also, the missing columns like "state" and "salary" are added in the data frames defined that are dataframe1 and dataframe2. Using union by name() function, dataframe1 and dataframe2 are merged by name.

What Users are saying..

profile image

Abhinav Agarwal

Graduate Student at Northwestern University
linkedin profile url

I come from Northwestern University, which is ranked 9th in the US. Although the high-quality academics at school taught me all the basics I needed, obtaining practical experience was a challenge.... Read More

Relevant Projects

Log Analytics Project with Spark Streaming and Kafka
In this spark project, you will use the real-world production logs from NASA Kennedy Space Center WWW server in Florida to perform scalable log analytics with Apache Spark, Python, and Kafka.

Getting Started with Azure Purview for Data Governance
In this Microsoft Azure Purview Project, you will learn how to consume the ingested data and perform analysis to find insights.

Airline Dataset Analysis using Hadoop, Hive, Pig and Athena
Hadoop Project- Perform basic big data analysis on airline dataset using big data tools -Pig, Hive and Athena.

Big Data Project for Solving Small File Problem in Hadoop Spark
This big data project focuses on solving the small file problem to optimize data processing efficiency by leveraging Apache Hadoop and Spark within AWS EMR by implementing and demonstrating effective techniques for handling large numbers of small files.

PySpark ETL Project for Real-Time Data Processing
In this PySpark ETL Project, you will learn to build a data pipeline and perform ETL operations for Real-Time Data Processing

Build an ETL Pipeline on EMR using AWS CDK and Power BI
In this ETL Project, you will learn build an ETL Pipeline on Amazon EMR with AWS CDK and Apache Hive. You'll deploy the pipeline using S3, Cloud9, and EMR, and then use Power BI to create dynamic visualizations of your transformed data.

GCP Project-Build Pipeline using Dataflow Apache Beam Python
In this GCP Project, you will learn to build a data pipeline using Apache Beam Python on Google Dataflow.

PySpark Project-Build a Data Pipeline using Hive and Cassandra
In this PySpark ETL Project, you will learn to build a data pipeline and perform ETL operations by integrating PySpark with Hive and Cassandra

Learn Efficient Multi-Source Data Processing with Talend ETL
In this Talend ETL Project , you will create a multi-source ETL Pipeline to load data from multiple sources such as MySQL Database, Azure Database, and API to Snowflake cloud using Talend Jobs.

Analyse Yelp Dataset with Spark & Parquet Format on Azure Databricks
In this Databricks Azure project, you will use Spark & Parquet file formats to analyse the Yelp reviews dataset. As part of this you will deploy Azure data factory, data pipelines and visualise the analysis.