Explain the overlay function in PySpark in Databricks

This recipe explains what the overlay function in PySpark in Databricks

Recipe Objective - Explain the overlay() function in PySpark in Databricks?

The Overlay function in Apache PySpark replaces the input with replacing, which starts at the pos(Position) and is of the length len. It returns the column with the string value from a different column. The Overlay(input, replace, pos[, len]) where four parameters "Input" which corresponds to String or the Binary expression that is to be applied, "Replace" which corresponds to the face of the same type as "Input," "Pos" which fits an Integer expression and "Len" which corresponds to an optional Integer expression. It returns the result type as the "Input."

System Requirements

  • Python (3.0 version)
  • Apache Spark (3.1.1 version)

This recipe explains what Overlay() functions and how to perform it in PySpark.

Implementing the Overlay() functions in Databricks in PySpark

# Importing packages
import pyspark.sql
from pyspark.sql import SparkSession
from pyspark.sql.functions import overlay
Databricks-1

The SparkSession and Overlay packages are imported into the environment to perform the Overlay() part in PySpark.

# Implementing the Overlay() functions in Databricks in PySpark
spark = SparkSession.builder.master("local[1]").appName("PySpark Overlay()").getOrCreate()
Sample_address = [(1,"15861 Bhagat Singh","RJ"),
(2,"45698 Ashoka Road","DE"),
(3,"23654 Laxmi Nagar","Bi")]
dataframe =spark.createDataFrame(Sample_address,["id","address","state"])
dataframe.show()
# Using the Overlay() function
dataframe = spark.createDataFrame([("FGHIJ_WSY", "HIJ")], ("col1", "col2"))
dataframe.select(overlay("col1", "col2", 8).alias("overlayed_column")).show()
Databricks-2 Databricks-3

The "Sample_address" value is created in which the data is defined. Using the Overlay() function that is replacing the column value with the string value from another column so here the value gave us "FGHIJ_WSY," and the name of the column is changed to "overlayed_column" as an alias to get it identified more easily.

What Users are saying..

profile image

Ed Godalle

Director Data Analytics at EY / EY Tech
linkedin profile url

I am the Director of Data Analytics with over 10+ years of IT experience. I have a background in SQL, Python, and Big Data working with Accenture, IBM, and Infosys. I am looking to enhance my skills... Read More

Relevant Projects

Real-Time Streaming of Twitter Sentiments AWS EC2 NiFi
Learn to perform 1) Twitter Sentiment Analysis using Spark Streaming, NiFi and Kafka, and 2) Build an Interactive Data Visualization for the analysis using Python Plotly.

Orchestrate Redshift ETL using AWS Glue and Step Functions
ETL Orchestration on AWS - Use AWS Glue and Step Functions to fetch source data and glean faster analytical insights on Amazon Redshift Cluster

Build Serverless Pipeline using AWS CDK and Lambda in Python
In this AWS Data Engineering Project, you will learn to build a serverless pipeline using AWS CDK and other AWS serverless technologies like AWS Lambda and Glue.

Azure Stream Analytics for Real-Time Cab Service Monitoring
Build an end-to-end stream processing pipeline using Azure Stream Analytics for real time cab service monitoring

Getting Started with Azure Purview for Data Governance
In this Microsoft Azure Purview Project, you will learn how to consume the ingested data and perform analysis to find insights.

Build an Incremental ETL Pipeline with AWS CDK
Learn how to build an Incremental ETL Pipeline with AWS CDK using Cryptocurrency data

Build Classification and Clustering Models with PySpark and MLlib
In this PySpark Project, you will learn to implement pyspark classification and clustering model examples using Spark MLlib.

Building Data Pipelines in Azure with Azure Synapse Analytics
In this Microsoft Azure Data Engineering Project, you will learn how to build a data pipeline using Azure Synapse Analytics, Azure Storage and Azure Synapse SQL pool to perform data analysis on the 2021 Olympics dataset.

AWS Project-Website Monitoring using AWS Lambda and Aurora
In this AWS Project, you will learn the best practices for website monitoring using AWS services like Lambda, Aurora MySQL, Amazon Dynamo DB and Kinesis.

Build a Scalable Event Based GCP Data Pipeline using DataFlow
In this GCP project, you will learn to build and deploy a fully-managed(serverless) event-driven data pipeline on GCP using services like Cloud Composer, Google Cloud Storage (GCS), Pub-Sub, Cloud Functions, BigQuery, BigTable