Explain the unionByName function in Spark in Databricks

This recipe explains what the unionByName function in Spark in Databricks

Recipe Objective - Explain the unionByName() function in Spark in Databricks?

In Spark, the unionByName() function is widely used as the transformation to merge or union two DataFrames with the different number of columns (different schema) by passing the allowMissingColumns with the value true. The important difference between unionByName() function and the union() function is that this function resolves columns by the name (and not by the position). In other words, the unionByName() function is used to merge two DataFrame’s by the column names instead of by the position. The Apache Spark Resilient Distributed Dataset(RDD) Transformations are defined as the spark operations that is when executed on the Resilient Distributed Datasets(RDD), it further results in the single or the multiple new defined RDD’s. As the RDD mostly are immutable so, the transformations always create the new RDD without updating an existing RDD so, which results in the creation of an RDD lineage. RDD Lineage is defined as the RDD operator graph or the RDD dependency graph. RDD Transformations are also defined as lazy operations that are none of the transformations get executed until an action is called from the user.

System Requirements

 

  • Scala (2.12 version)
  • Apache Spark (3.1.1 version)

This recipe explains what is unionByName() function and explaining the usage of unionByName() in Spark.

Implementing the unionByName() function in Databricks in Spark

// Importing packages
import org.apache.spark.sql._
import spark.implicits._
import org.apache.spark.sql.functions.{col,lit}
Databricks-1

The Sparksession, Row, MapType, StringType, col, explode, StructType, StructField, StringType are imported in the environment to use map() transformation in the PySpark .

// Implementing the unionByName() function in Databricks in Spark
object unionByName extends App {
val spark: SparkSession = SparkSession.builder()
.master("local[1]")
.appName("Spark unionByName() function")
.getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
// Merge Two DataFrames with Different Columns
// Create DataFrame df1 with columns name,dept & age
val sample_data = Seq(("Ram","Sales",44), ("Shyam","Sales",46),
("Amit","Sales",40), ("Rahul","Finance",34) )
val dataframe1 = sample_data.toDF("name","dept","age")
dataframe1.printSchema()
// Create DataFrame df2 with columns name,dep,state & salary
val sample_data2 = Seq(("Ram","Sales","RJ",8000),("Shyam","Finance","DL",8000),
("Amit","Finance","RJ",8900),("Rahul","Marketing","DL",9000))
val dataframe2 = sample_data2.toDF("name","dept","state","salary")
dataframe2.printSchema()
// Creating merged columns
val merged_columns = dataframe1.columns.toSet ++ dataframe2.columns.toSet
def Get_NewColumns(column: Set[String], merged_columns: Set[String]) = {
merged_columns.toList.map(x => x match {
case x if column.contains(x) => col(x)
case _ => lit(null).as(x)
})
}
val newdataframe1 = dataframe1.select(Get_NewColumns(dataframe1.columns.toSet, merged_columns):_*)
val newdataframe2 = dataframe2.select(Get_NewColumns(dataframe2.columns.toSet, merged_columns):_*)
// Using unionByName() function
val merged_dataframe = newdataframe1.unionByName(newdataframe2)
merged_dataframe.show()
}
Databricks-2

Databricks-3
Databricks-4

The Spark Session is defined. The "Sample_data" is defined. Further, the DataFrame "dataframe1" is defined with the missing column state and salary. The "sample_data2" is defined. Further, the "dataframe2" is defined with the missing column age. The dataframe1 and dataframe2 are printed using the printSchema() function. Also, the new DataFrames are created from the existing ones after adding the missing columns. Further, the newly added columns contain the null values and constant column are added using the lit() function. Finally, two new DataFrame’s are merged by using the column names.

What Users are saying..

profile image

Abhinav Agarwal

Graduate Student at Northwestern University
linkedin profile url

I come from Northwestern University, which is ranked 9th in the US. Although the high-quality academics at school taught me all the basics I needed, obtaining practical experience was a challenge.... Read More

Relevant Projects

Spark Project-Analysis and Visualization on Yelp Dataset
The goal of this Spark project is to analyze business reviews from Yelp dataset and ingest the final output of data processing in Elastic Search.Also, use the visualisation tool in the ELK stack to visualize various kinds of ad-hoc reports from the data.

Build a big data pipeline with AWS Quicksight, Druid, and Hive
Use the dataset on aviation for analytics to simulate a complex real-world big data pipeline based on messaging with AWS Quicksight, Druid, NiFi, Kafka, and Hive.

Web Server Log Processing using Hadoop in Azure
In this big data project, you will use Hadoop, Flume, Spark and Hive to process the Web Server logs dataset to glean more insights on the log data.

Real-Time Streaming of Twitter Sentiments AWS EC2 NiFi
Learn to perform 1) Twitter Sentiment Analysis using Spark Streaming, NiFi and Kafka, and 2) Build an Interactive Data Visualization for the analysis using Python Plotly.

Learn Efficient Multi-Source Data Processing with Talend ETL
In this Talend ETL Project , you will create a multi-source ETL Pipeline to load data from multiple sources such as MySQL Database, Azure Database, and API to Snowflake cloud using Talend Jobs.

Talend Real-Time Project for ETL Process Automation
In this Talend Project, you will learn how to build an ETL pipeline in Talend Open Studio to automate the process of File Loading and Processing.

Build an ETL Pipeline on EMR using AWS CDK and Power BI
In this ETL Project, you will learn build an ETL Pipeline on Amazon EMR with AWS CDK and Apache Hive. You'll deploy the pipeline using S3, Cloud9, and EMR, and then use Power BI to create dynamic visualizations of your transformed data.

Building Data Pipelines in Azure with Azure Synapse Analytics
In this Microsoft Azure Data Engineering Project, you will learn how to build a data pipeline using Azure Synapse Analytics, Azure Storage and Azure Synapse SQL pool to perform data analysis on the 2021 Olympics dataset.

Deploy an Application to Kubernetes in Google Cloud using GKE
In this Kubernetes Big Data Project, you will automate and deploy an application using Docker, Google Kubernetes Engine (GKE), and Google Cloud Functions.

PySpark Project-Build a Data Pipeline using Hive and Cassandra
In this PySpark ETL Project, you will learn to build a data pipeline and perform ETL operations by integrating PySpark with Hive and Cassandra