Explain the withColumnRenamed function in PySpark in Databricks

This recipe explains what the withColumnRenamed function in PySpark in Databricks

Recipe Objective - Explain the withColumnRenamed() function in PySpark in Databricks?

In PySpark, the withColumnRenamed() function is widely used to rename columns or multiple columns in PySpark Dataframe. As the DataFrame’s are the immutable collection so, it can’t be renamed or updated instead when using the withColumnRenamed() function, it creates the new DataFrame with the updated column names. The Resilient Distributed Datasets or RDDs are defined as the fundamental data structure of Apache PySpark. It was developed by The Apache Software Foundation. It is the immutable distributed collection of objects. In RDD, each dataset is divided into logical partitions which may be computed on different nodes of the cluster. The RDDs concept was launched in the year 2011. The Dataset is defined as a data structure in the SparkSQL that is strongly typed and is a map to the relational schema. It represents the structured queries with encoders and is an extension to dataframe API. Spark Dataset provides both the type safety and object-oriented programming interface. The Datasets concept was launched in the year 2015.

Build a Real-Time Dashboard with Spark, Grafana and Influxdb

System Requirements

  • Python (3.0 version)
  • Apache Spark (3.1.1 version)

This recipe explains what is withColumnRenamed() function and explains their usage in PySpark.

Implementing the withColumnRenamed() function in Databricks in PySpark

# Importing packages
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
Databricks-1

The Sparksession, StructType, StructField, StringType, IntegerType and all SQL Functions are imported in the environment so as to use withColumnRenamed() function in the PySpark .

# Implementing the withColumnRenamed() function in Databricks in PySpark
spark = SparkSession.builder.appName('withColumRenamed() PySpark').getOrCreate()
sample_dataDataframe = [(('Ram','','Aggarwal'),'1994-06-02','M',4000),
(('Shyam','Gupta',''),'2002-07-21','M',5000),
(('Amit','','Jain'),'1988-07-02','M',5000),
(('Pooja','Rahul','Kumar'),'1977-09-02','F',5000),
(('Sunita','Kumari','Kapoor'),'1990-04-18','F',-2)
]
sample_schema = StructType([
StructField('name', StructType([
StructField('firstname', StringType(), True),
StructField('middlename', StringType(), True),
StructField('lastname', StringType(), True)
])),
StructField('dob', StringType(), True),
StructField('gender', StringType(), True),
StructField('salary', IntegerType(), True)
])
dataframe = spark.createDataFrame(data = sample_dataDataframe, schema = sample_schema)
dataframe.printSchema()
# Using withColumnRenamed() function
dataframe.withColumnRenamed("dob","Date_Of_Birth").printSchema()
# Using withColumnRenamed() function on multiple column
dataframe2 = dataframe.withColumnRenamed("dob","Date_Of_Birth") \
.withColumnRenamed("salary","salaryAmount")
dataframe2.printSchema()
Databricks-2

Databricks-3

The Spark Session is defined. The "sample_dataDataframe" and "sample_schema" are defined. The DataFrame "data frame" is defined using the sample_dataDataframe and sample_schema. Using the withColumnRenamed() function returns the new DataFrame and doesn’t modify the current DataFrame. It changes the column “dob” to “DateOfBirth” on the PySpark DataFrame. The DataFrame "data frame" is defined while using withColumnRenamed() function on "dob" and "salary" columns.

What Users are saying..

profile image

Abhinav Agarwal

Graduate Student at Northwestern University
linkedin profile url

I come from Northwestern University, which is ranked 9th in the US. Although the high-quality academics at school taught me all the basics I needed, obtaining practical experience was a challenge.... Read More

Relevant Projects

Streaming Data Pipeline using Spark, HBase and Phoenix
Build a Real-Time Streaming Data Pipeline for an application that monitors oil wells using Apache Spark, HBase and Apache Phoenix .

Build a real-time Streaming Data Pipeline using Flink and Kinesis
In this big data project on AWS, you will learn how to run an Apache Flink Python application for a real-time streaming platform using Amazon Kinesis.

Python and MongoDB Project for Beginners with Source Code-Part 1
In this Python and MongoDB Project, you learn to do data analysis using PyMongo on MongoDB Atlas Cluster.

Deploy an Application to Kubernetes in Google Cloud using GKE
In this Kubernetes Big Data Project, you will automate and deploy an application using Docker, Google Kubernetes Engine (GKE), and Google Cloud Functions.

Build a Spark Streaming Pipeline with Synapse and CosmosDB
In this Spark Streaming project, you will learn to build a robust and scalable spark streaming pipeline using Azure Synapse Analytics and Azure Cosmos DB and also gain expertise in window functions, joins, and logic apps for comprehensive real-time data analysis and processing.

AWS Project for Batch Processing with PySpark on AWS EMR
In this AWS Project, you will learn how to perform batch processing on Wikipedia data with PySpark on AWS EMR.

Build a big data pipeline with AWS Quicksight, Druid, and Hive
Use the dataset on aviation for analytics to simulate a complex real-world big data pipeline based on messaging with AWS Quicksight, Druid, NiFi, Kafka, and Hive.

Hive Mini Project to Build a Data Warehouse for e-Commerce
In this hive project, you will design a data warehouse for e-commerce application to perform Hive analytics on Sales and Customer Demographics data using big data tools such as Sqoop, Spark, and HDFS.

Build Classification and Clustering Models with PySpark and MLlib
In this PySpark Project, you will learn to implement pyspark classification and clustering model examples using Spark MLlib.

Create A Data Pipeline based on Messaging Using PySpark Hive
In this PySpark project, you will simulate a complex real-world data pipeline based on messaging. This project is deployed using the following tech stack - NiFi, PySpark, Hive, HDFS, Kafka, Airflow, Tableau and AWS QuickSight.