Explain the withColumnRenamed function in PySpark in Databricks

This recipe explains what the withColumnRenamed function in PySpark in Databricks

Recipe Objective - Explain the withColumnRenamed() function in PySpark in Databricks?

In PySpark, the withColumnRenamed() function is widely used to rename columns or multiple columns in PySpark Dataframe. As the DataFrame’s are the immutable collection so, it can’t be renamed or updated instead when using the withColumnRenamed() function, it creates the new DataFrame with the updated column names. The Resilient Distributed Datasets or RDDs are defined as the fundamental data structure of Apache PySpark. It was developed by The Apache Software Foundation. It is the immutable distributed collection of objects. In RDD, each dataset is divided into logical partitions which may be computed on different nodes of the cluster. The RDDs concept was launched in the year 2011. The Dataset is defined as a data structure in the SparkSQL that is strongly typed and is a map to the relational schema. It represents the structured queries with encoders and is an extension to dataframe API. Spark Dataset provides both the type safety and object-oriented programming interface. The Datasets concept was launched in the year 2015.

Build a Real-Time Dashboard with Spark, Grafana and Influxdb

System Requirements

  • Python (3.0 version)
  • Apache Spark (3.1.1 version)

This recipe explains what is withColumnRenamed() function and explains their usage in PySpark.

Implementing the withColumnRenamed() function in Databricks in PySpark

# Importing packages
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
Databricks-1

The Sparksession, StructType, StructField, StringType, IntegerType and all SQL Functions are imported in the environment so as to use withColumnRenamed() function in the PySpark .

# Implementing the withColumnRenamed() function in Databricks in PySpark
spark = SparkSession.builder.appName('withColumRenamed() PySpark').getOrCreate()
sample_dataDataframe = [(('Ram','','Aggarwal'),'1994-06-02','M',4000),
(('Shyam','Gupta',''),'2002-07-21','M',5000),
(('Amit','','Jain'),'1988-07-02','M',5000),
(('Pooja','Rahul','Kumar'),'1977-09-02','F',5000),
(('Sunita','Kumari','Kapoor'),'1990-04-18','F',-2)
]
sample_schema = StructType([
StructField('name', StructType([
StructField('firstname', StringType(), True),
StructField('middlename', StringType(), True),
StructField('lastname', StringType(), True)
])),
StructField('dob', StringType(), True),
StructField('gender', StringType(), True),
StructField('salary', IntegerType(), True)
])
dataframe = spark.createDataFrame(data = sample_dataDataframe, schema = sample_schema)
dataframe.printSchema()
# Using withColumnRenamed() function
dataframe.withColumnRenamed("dob","Date_Of_Birth").printSchema()
# Using withColumnRenamed() function on multiple column
dataframe2 = dataframe.withColumnRenamed("dob","Date_Of_Birth") \
.withColumnRenamed("salary","salaryAmount")
dataframe2.printSchema()
Databricks-2

Databricks-3

The Spark Session is defined. The "sample_dataDataframe" and "sample_schema" are defined. The DataFrame "data frame" is defined using the sample_dataDataframe and sample_schema. Using the withColumnRenamed() function returns the new DataFrame and doesn’t modify the current DataFrame. It changes the column “dob” to “DateOfBirth” on the PySpark DataFrame. The DataFrame "data frame" is defined while using withColumnRenamed() function on "dob" and "salary" columns.

What Users are saying..

profile image

Jingwei Li

Graduate Research assistance at Stony Brook University
linkedin profile url

ProjectPro is an awesome platform that helps me learn much hands-on industrial experience with a step-by-step walkthrough of projects. There are two primary paths to learn: Data Science and Big Data.... Read More

Relevant Projects

A Hands-On Approach to Learn Apache Spark using Scala
Get Started with Apache Spark using Scala for Big Data Analysis

Hive Mini Project to Build a Data Warehouse for e-Commerce
In this hive project, you will design a data warehouse for e-commerce application to perform Hive analytics on Sales and Customer Demographics data using big data tools such as Sqoop, Spark, and HDFS.

Learn How to Implement SCD in Talend to Capture Data Changes
In this Talend Project, you will build an ETL pipeline in Talend to capture data changes using SCD techniques.

SQL Project for Data Analysis using Oracle Database-Part 4
In this SQL Project for Data Analysis, you will learn to efficiently write queries using WITH clause and analyse data using SQL Aggregate Functions and various other operators like EXISTS, HAVING.

Python and MongoDB Project for Beginners with Source Code-Part 2
In this Python and MongoDB Project for Beginners, you will learn how to use Apache Sedona and perform advanced analysis on the Transportation dataset.

AWS CDK Project for Building Real-Time IoT Infrastructure
AWS CDK Project for Beginners to Build Real-Time IoT Infrastructure and migrate and analyze data to

Build a big data pipeline with AWS Quicksight, Druid, and Hive
Use the dataset on aviation for analytics to simulate a complex real-world big data pipeline based on messaging with AWS Quicksight, Druid, NiFi, Kafka, and Hive.

Build Streaming Data Pipeline using Azure Stream Analytics
In this Azure Data Engineering Project, you will learn how to build a real-time streaming platform using Azure Stream Analytics, Azure Event Hub, and Azure SQL database.

Build a Real-Time Spark Streaming Pipeline on AWS using Scala
In this Spark Streaming project, you will build a real-time spark streaming pipeline on AWS using Scala and Python.

Implementing Slow Changing Dimensions in a Data Warehouse using Hive and Spark
Hive Project- Understand the various types of SCDs and implement these slowly changing dimesnsion in Hadoop Hive and Spark.