Explain the withColumn function in PySpark in Databricks

This recipe explains what the withColumn function in PySpark in Databricks
Last Updated: 12 May 2023

Get access to Big Data projects View all Big Data projects

PYSPARK PROJECTS DATA CLEANING PYTHON DATA MUNGING MACHINE LEARNING RECIPES PANDAS CHEATSHEET ALL TAGS

Recipe Objective - Explain the withColumn() function in PySpark in Databricks?

In PySpark, the withColumn() function is widely used and defined as the transformation function of the DataFrame which is further used to change the value, convert the datatype of an existing column, create the new column etc. The PySpark withColumn() on the DataFrame, the casting or changing the data type of the column can be done using the cast() function. The PySpark withColumn() function of DataFrame can also be used to change the value of an existing column by passing an existing column name as the first argument and the value to be assigned as the second argument to the withColumn() function and the second argument should be the Column type. By passing the column name to the first argument of withColumn() transformation function, a new column can be created. The Resilient Distributed Datasets or RDDs are defined as the fundamental data structure of Apache PySpark. It was developed by The Apache Software Foundation. It is the immutable distributed collection of objects. In RDD, each dataset is divided into logical partitions which may be computed on different nodes of the cluster. The RDDs concept was launched in the year 2011. The Dataset is defined as a data structure in the SparkSQL that is strongly typed and is a map to the relational schema. It represents the structured queries with encoders and is an extension to dataframe API. Spark Dataset provides both the type safety and object-oriented programming interface. The Datasets concept was launched in the year 2015.

Recipe Objective - Explain the withColumn() function in PySpark in Databricks?
- System Requirements
- Implementing the withColumn() function in Databricks in PySpark

System Requirements

Python (3.0 version)
Apache Spark (3.1.1 version)

This recipe explains what is with column() function and explains its usage in PySpark.

Build Professional SQL Projects for Data Analysis with ProjectPro

Implementing the withColumn() function in Databricks in PySpark

# Importing packages import pyspark from pyspark.sql import SparkSession from pyspark.sql.types import MapType, StringType from pyspark.sql.functions import col, lit from pyspark.sql.types import StructType,StructField, StringType Databricks-1

The Sparksession, StructType, col, lit, StructField, StringType, IntegerType and all SQL Functions are imported in the environment to use withColumn() function in the PySpark .

# Implementing the withColumn() function in Databricks in PySpark spark = SparkSession.builder.appName('withColumn() PySpark').getOrCreate() sample_data = [('Ram','','Aggarwal','1981-06-02','M',4000), ('Shyam','Gupta','','2005-04-02','M',5000), ('Amit','','Jain','1988-07-02','M',5000), ('Pooja','Raju','Bansal','1977-08-03','F',5000), ('Mary','Yadav','Brown','1970-04-15','F',-2) ] sample_columns = ["firstname","middlename","lastname","dob","gender","salary"] dataframe = spark.createDataFrame(data = sample_data, schema = sample_columns) dataframe.printSchema() dataframe.show(truncate=False) # Changing datatype using withColumn() function dataframe2 = dataframe.withColumn("salary",col("salary").cast("Integer")) dataframe2.printSchema() dataframe2.show(truncate=False) # Updating value of an existing column dataframe3 = dataframe.withColumn("salary",col("salary")*100) dataframe3.printSchema() dataframe3.show(truncate=False) # Creating new column using existing one dataframe4 = dataframe.withColumn("Copied_Column",col("salary")* -1) dataframe4.printSchema() # Adding new column using withColumn() function dataframe5 = dataframe.withColumn("Country", lit("USA")) dataframe5.printSchema() dataframe6 = dataframe.withColumn("Country", lit("USA")) \ .withColumn("anotherColumn",lit("anotherValue")) dataframe6.printSchema() # Renaming a column dataframe.withColumnRenamed("gender","sex") \ .show(truncate=False) # Dropping a column from PySpark Datafrmae dataframe4.drop("Copied_Column") \ .show(truncate=False) Databricks-2
Databricks-3
Databricks-4
Databricks-5
Databricks-6
Databricks-7
Databricks-8

The Spark Session is defined. The "sample_data" and "sample_columns" are defined. The DataFrame "data frame" is defined using the sample_data and sample_columns. Using the withColumn() function, the data type is changed from String to Integer. Also, the value of the "salary" column is updated. Further, the "Copied_Column" column is created using the withColumn() function. The "gender" column is renamed to "sex" using the wothColumn() function. Finally, the newly created "Copied_Column" column is dropped using the withColumn() function.

Download Materials

Databricks_1

Databricks_2

Databricks_3

Databricks_4

Databricks_5

Databricks_6

Databricks_7

Databricks_8

Databricks_9

What Users are saying..

Savvy Sahai

Data Science Intern, Capgemini

As a student looking to break into the field of data engineering and data science, one can get really confused as to which path to take. Very few ways to do it are Google, YouTube, etc. I was one of... Read More

Relevant Projects

Machine Learning Projects

Data Science Projects

Python Projects for Data Science

Data Science Projects in R

Machine Learning Projects for Beginners

Deep Learning Projects

Neural Network Projects

Tensorflow Projects

NLP Projects

Kaggle Projects

IoT Projects

Big Data Projects

Hadoop Real-Time Projects Examples

Spark Projects

Data Analytics Projects for Students

Relevant Projects

PySpark Project-Build a Data Pipeline using Kafka and Redshift

In this PySpark ETL Project, you will learn to build a data pipeline and perform ETL operations by integrating PySpark with Apache Kafka and AWS Redshift

View Project Details

Getting Started with Azure Purview for Data Governance

In this Microsoft Azure Purview Project, you will learn how to consume the ingested data and perform analysis to find insights.

View Project Details

Building Real-Time AWS Log Analytics Solution

In this AWS Project, you will build an end-to-end log analytics solution to collect, ingest and process data. The processed data can be analysed to monitor the health of production systems on AWS.

View Project Details

Build a big data pipeline with AWS Quicksight, Druid, and Hive

Use the dataset on aviation for analytics to simulate a complex real-world big data pipeline based on messaging with AWS Quicksight, Druid, NiFi, Kafka, and Hive.

View Project Details

Learn How to Implement SCD in Talend to Capture Data Changes

In this Talend Project, you will build an ETL pipeline in Talend to capture data changes using SCD techniques.

View Project Details

PySpark Project-Build a Data Pipeline using Hive and Cassandra

In this PySpark ETL Project, you will learn to build a data pipeline and perform ETL operations by integrating PySpark with Hive and Cassandra

View Project Details

GCP Project to Explore Cloud Functions using Python Part 1

In this project we will explore the Cloud Services of GCP such as Cloud Storage, Cloud Engine and PubSub

View Project Details

Real-Time Streaming of Twitter Sentiments AWS EC2 NiFi

Learn to perform 1) Twitter Sentiment Analysis using Spark Streaming, NiFi and Kafka, and 2) Build an Interactive Data Visualization for the analysis using Python Plotly.

View Project Details

Build an ETL Pipeline with Talend for Export of Data from Cloud

In this Talend ETL Project, you will build an ETL pipeline using Talend to export employee data from the Snowflake database and investor data from the Azure database, combine them using a Loop-in mechanism, filter the data for each sales representative, and export the result as a CSV file.

View Project Details

Building Data Pipelines in Azure with Azure Synapse Analytics

In this Microsoft Azure Data Engineering Project, you will learn how to build a data pipeline using Azure Synapse Analytics, Azure Storage and Azure Synapse SQL pool to perform data analysis on the 2021 Olympics dataset.

View Project Details

Explain the withColumn function in PySpark in Databricks

Recipe Objective - Explain the withColumn() function in PySpark in Databricks?

Table of Contents

System Requirements

Implementing the withColumn() function in Databricks in PySpark

Savvy Sahai

Relevant Projects

You might also like

Relevant Projects