Explain the unionByName function in PySpark in Databricks

This recipe explains what the unionByName function in PySpark in Databricks
Last Updated: 23 Jan 2023

Get access to Big Data projects View all Big Data projects

APACHE SPARK PROJECTS DATA CLEANING PYTHON DATA MUNGING MACHINE LEARNING RECIPES PANDAS CHEATSHEET ALL TAGS

Recipe Objective - Explain the unionByName() function in PySpark in Databricks?

In PySpark, the unionByName() function is widely used as the transformation to merge or union two DataFrames with the different number of columns (different schema) by passing the allowMissingColumns with the value true. The important difference between unionByName() function and the union() function is that this function resolves columns by the name (and not by the position). In other words, the unionByName() function is used to merge two DataFrame’s by the column names instead of by the position. The Apache PySpark Resilient Distributed Dataset(RDD) Transformations are defined as the spark operations that is when executed on the Resilient Distributed Datasets(RDD), it further results in the single or the multiple new defined RDD’s. As the RDD mostly are immutable so, the transformations always create the new RDD without updating an existing RDD so, which results in the creation of an RDD lineage. RDD Lineage is defined as the RDD operator graph or the RDD dependency graph. RDD Transformations are also defined as lazy operations that are none of the transformations get executed until an action is called from the user.

Build a Real-Time Dashboard with Spark, Grafana and Influxdb

Recipe Objective - Explain the unionByName() function in PySpark in Databricks?
- System Requirements
- Implementing the unionByName() function in Databricks in PySpark

System Requirements

Python (3.0 version)
Apache Spark (3.1.1 version)

This recipe explains what is unionByName() function and explaining the usage of unionByName() in PySpark.

Learn to Transform your data pipeline with Azure Data Factory!

Implementing the unionByName() function in Databricks in PySpark

# Importing packages import pyspark from pyspark.sql import SparkSession, Row from pyspark.sql.functions import col, lit Databricks-1

The Sparksession, Row, col, lit is imported in the environment to use unionByName() function in the PySpark .

# Implementing the unionByName() function in Databricks in PySpark spark = SparkSession.builder.appName('unionByName() PySpark').getOrCreate() # Creating dataframe1 sample_data = [("Ram","Sales",44), ("Shyam","Sales",46), ("Amit","Sales",40), ("Rahul","Finance",34) ] sample_columns= ["name","dept","age"] dataframe1 = spark.createDataFrame(data = sample_data, schema = sample_columns) dataframe1.printSchema() # Creating dataframe2 sample_data2=[("Ram","Sales","RJ",8000),("Shyam","Finance","DL",8000), ("Amit","Finance","RJ",8900),("Rahul","Marketing","DL",9000)] sample_columns2= ["name","dept","state","salary"] dataframe2 = spark.createDataFrame(data = sample_data2, schema = sample_columns2) dataframe2.printSchema() # Adding missing columns 'state' & 'salary' to dataframe1 for column in [column for column in dataframe2.columns if column not in dataframe1.columns]: dataframe1 = dataframe1.withColumn(column, lit(None)) # Adding missing column 'age' to dataframe2 for column in [column for column in dataframe1.columns if column not in dataframe2.columns]: dataframe2 = dataframe2.withColumn(column, lit(None)) # Merging two dataframe's dataframe1 & dataframe2 by name merged_dataframe = dataframe1.unionByName(dataframe2) merged_dataframe.show() Databricks-2
Databricks-3
Databricks-4

The Spark Session is defined. The "sample_data" is defined. Further, the DataFrame "dataframe1" is defined with the missing column state and salary. The "sample_data2" is defined. Further, the "dataframe2" is defined with the missing column age. The dataframe1 and dataframe2 are printed using the printSchema() function. Also, the missing columns like "state" and "salary" are added in the data frames defined that are dataframe1 and dataframe2. Using union by name() function, dataframe1 and dataframe2 are merged by name.

Download Materials

Databricks_1

Databricks_2

Databricks_3

Databricks_4

What Users are saying..

Gautam Vermani

Data Consultant at Confidential

Having worked in the field of Data Science, I wanted to explore how I can implement projects in other domains, So I thought of connecting with ProjectPro. A project that helped me absorb this topic... Read More

Relevant Projects

Machine Learning Projects

Data Science Projects

Python Projects for Data Science

Data Science Projects in R

Machine Learning Projects for Beginners

Deep Learning Projects

Neural Network Projects

Tensorflow Projects

NLP Projects

Kaggle Projects

IoT Projects

Big Data Projects

Hadoop Real-Time Projects Examples

Spark Projects

Data Analytics Projects for Students

Relevant Projects

PySpark Project to Learn Advanced DataFrame Concepts

In this PySpark Big Data Project, you will gain hands-on experience working with advanced functionalities of PySpark Dataframes and Performance Optimization.

View Project Details

PySpark Project-Build a Data Pipeline using Hive and Cassandra

In this PySpark ETL Project, you will learn to build a data pipeline and perform ETL operations by integrating PySpark with Hive and Cassandra

View Project Details

Learn Data Processing with Spark SQL using Scala on AWS

In this AWS Spark SQL project, you will analyze the Movies and Ratings Dataset using RDD and Spark SQL to get hands-on experience on the fundamentals of Scala programming language.

View Project Details

Build an ETL Pipeline for Financial Data Analytics on GCP-IaC

In this GCP Project, you will learn to build an ETL pipeline on Google Cloud Platform to maximize the efficiency of financial data analytics with GCP-IaC.

View Project Details

Data Processing and Transformation in Hive using Azure VM

Hive Practice Example - Explore hive usage efficiently for data transformation and processing in this big data project using Azure VM.

View Project Details

GCP Project-Build Pipeline using Dataflow Apache Beam Python

In this GCP Project, you will learn to build a data pipeline using Apache Beam Python on Google Dataflow.

View Project Details

A Hands-On Approach to Learn Apache Spark using Scala

Get Started with Apache Spark using Scala for Big Data Analysis

View Project Details

Streaming Data Pipeline using Spark, HBase and Phoenix

Build a Real-Time Streaming Data Pipeline for an application that monitors oil wells using Apache Spark, HBase and Apache Phoenix .

View Project Details

AWS Project for Batch Processing with PySpark on AWS EMR

In this AWS Project, you will learn how to perform batch processing on Wikipedia data with PySpark on AWS EMR.

View Project Details

COVID-19 Data Analysis Project using Python and AWS Stack

COVID-19 Data Analysis Project using Python and AWS to build an automated data pipeline that processes COVID-19 data from Johns Hopkins University and generates interactive dashboards to provide insights into the pandemic for public health officials, researchers, and the general public.

View Project Details

Explain the unionByName function in PySpark in Databricks

Recipe Objective - Explain the unionByName() function in PySpark in Databricks?

Table of Contents

System Requirements

Implementing the unionByName() function in Databricks in PySpark

Gautam Vermani

Relevant Projects

You might also like

Relevant Projects