Explain the overlay function in PySpark in Databricks

This recipe explains what the overlay function in PySpark in Databricks

Recipe Objective - Explain the overlay() function in PySpark in Databricks?

The Overlay function in Apache PySpark replaces the input with replacing, which starts at the pos(Position) and is of the length len. It returns the column with the string value from a different column. The Overlay(input, replace, pos[, len]) where four parameters "Input" which corresponds to String or the Binary expression that is to be applied, "Replace" which corresponds to the face of the same type as "Input," "Pos" which fits an Integer expression and "Len" which corresponds to an optional Integer expression. It returns the result type as the "Input."

System Requirements

  • Python (3.0 version)
  • Apache Spark (3.1.1 version)

This recipe explains what Overlay() functions and how to perform it in PySpark.

Implementing the Overlay() functions in Databricks in PySpark

# Importing packages
import pyspark.sql
from pyspark.sql import SparkSession
from pyspark.sql.functions import overlay
Databricks-1

The SparkSession and Overlay packages are imported into the environment to perform the Overlay() part in PySpark.

# Implementing the Overlay() functions in Databricks in PySpark
spark = SparkSession.builder.master("local[1]").appName("PySpark Overlay()").getOrCreate()
Sample_address = [(1,"15861 Bhagat Singh","RJ"),
(2,"45698 Ashoka Road","DE"),
(3,"23654 Laxmi Nagar","Bi")]
dataframe =spark.createDataFrame(Sample_address,["id","address","state"])
dataframe.show()
# Using the Overlay() function
dataframe = spark.createDataFrame([("FGHIJ_WSY", "HIJ")], ("col1", "col2"))
dataframe.select(overlay("col1", "col2", 8).alias("overlayed_column")).show()
Databricks-2 Databricks-3

The "Sample_address" value is created in which the data is defined. Using the Overlay() function that is replacing the column value with the string value from another column so here the value gave us "FGHIJ_WSY," and the name of the column is changed to "overlayed_column" as an alias to get it identified more easily.

What Users are saying..

profile image

Anand Kumpatla

Sr Data Scientist @ Doubleslash Software Solutions Pvt Ltd
linkedin profile url

ProjectPro is a unique platform and helps many people in the industry to solve real-life problems with a step-by-step walkthrough of projects. A platform with some fantastic resources to gain... Read More

Relevant Projects

SQL Project for Data Analysis using Oracle Database-Part 1
In this SQL Project for Data Analysis, you will learn to efficiently leverage various analytical features and functions accessible through SQL in Oracle Database

Build Classification and Clustering Models with PySpark and MLlib
In this PySpark Project, you will learn to implement pyspark classification and clustering model examples using Spark MLlib.

SQL Project for Data Analysis using Oracle Database-Part 6
In this SQL project, you will learn the basics of data wrangling with SQL to perform operations on missing data, unwanted features and duplicated records.

Build a Real-Time Dashboard with Spark, Grafana, and InfluxDB
Use Spark , Grafana, and InfluxDB to build a real-time e-commerce users analytics dashboard by consuming different events such as user clicks, orders, demographics

Spark Project-Analysis and Visualization on Yelp Dataset
The goal of this Spark project is to analyze business reviews from Yelp dataset and ingest the final output of data processing in Elastic Search.Also, use the visualisation tool in the ELK stack to visualize various kinds of ad-hoc reports from the data.

Hands-On Real Time PySpark Project for Beginners
In this PySpark project, you will learn about fundamental Spark architectural concepts like Spark Sessions, Transformation, Actions, and Optimization Techniques using PySpark

Data Processing and Transformation in Hive using Azure VM
Hive Practice Example - Explore hive usage efficiently for data transformation and processing in this big data project using Azure VM.

Azure Stream Analytics for Real-Time Cab Service Monitoring
Build an end-to-end stream processing pipeline using Azure Stream Analytics for real time cab service monitoring

AWS Project-Website Monitoring using AWS Lambda and Aurora
In this AWS Project, you will learn the best practices for website monitoring using AWS services like Lambda, Aurora MySQL, Amazon Dynamo DB and Kinesis.

PySpark Tutorial - Learn to use Apache Spark with Python
PySpark Project-Get a handle on using Python with Spark through this hands-on data processing spark python tutorial.