Explain translate and substring function in PySpark in Databricks

This recipe explains what translate and substring function in PySpark in Databricks

Recipe Objective - Explain translate() and substring() function in PySpark in Databricks?

The Translate() function in Apache PySpark translates any character that matches the given matchString(Defined by the user) in the column by the already given replaceString. It helps in replacing character by nature of the given Dataframe column value. Translate(Column, String, String) were three parameters "Column" which corresponds to Column that is to be applied, "String" which corresponds to matchString that is String to be matched, and "String" which corresponds to replaceString that is String to be replaced with. It returns the Column object. The Substring() function in Apache PySpark is used to extract the substring from a DataFrame string column on the provided position and the length of the string defined by the user. The substring() function can be used with the select() function and selectExpr() function to get the substring of the column(date) as the year, month, and day.

Learn Spark SQL for Relational Big Data Procesing

System Requirements

  • Python (3.0 version)
  • Apache Spark (3.1.1 version)

This recipe explains translate() and substring() functions and how to perform them in PySpark.

Implementing the translate() and substring() functions in Databricks in PySpark

# Importing packages
import pyspark.sql
from pyspark.sql import SparkSession
from pyspark.sql.functions import translate
from pyspark.sql.functions import col, substring

Databricks-1

The SparkSession, Translate, and Col, Substring packages are imported in the environment to perform the translate() and Substring()function in PySpark.

# Implementing the translate() and substring() functions in Databricks in PySpark
spark = SparkSession.builder.master("local[1]").appName("PySpark Translate() Substring()").getOrCreate()
Sample_address = [(1,"15861 Bhagat Singh","RJ"),
(2,"45698 Ashoka Road","DE"),
(3,"23654 Laxmi Nagar","Bi")]
dataframe =spark.createDataFrame(Sample_address,["id","address","state"])
dataframe.show()
# Using the translate() function
dataframe.withColumn('address', translate('address', '234', 'DEF')) \
.show(truncate=False)
# Defining data for Substring() function
Sample_data = [(1,"30654128"),(2,"36985215")]
Sample_columns = ["id","date"]
dataframe1 = spark.createDataFrame(Sample_data, Sample_columns)
# Using the Substring() function with select() function
dataframe2 = dataframe1.select('date', substring('date', 2,4).alias('year'), \
substring('date', 6,3).alias('month'), \
substring('date', 8,3).alias('day'))
# Using the Substring() function with selectExpr() function
dataframe3 = dataframe1.selectExpr('date', 'substring(date, 2,4) as year', \
'substring(date, 6,2) as month', \
'substring(date, 8,3) as day')
# Using the Substring() function with Column type
dataframe4 = dataframe1.withColumn('year', col('date').substr(2, 5))\
.withColumn('month',col('date').substr(6, 3))\
.withColumn('day', col('date').substr(8, 3))
dataframe4.show()

Databricks-2
Databricks-3
Databricks-4

The "Sample_address" value is created in which the data is defined. Using the translate() function that is every character of 2 is replaced with the D, three replaced with the C, and 4 replaced with the D on the address column in the dataframe. Further, the "Sample_data" and the "Sample_columns" is defined for the substring() function and "dataframe1" is defined. The "dataframe2" is defined using the substring() function with the select() function . The "dataframe3" is defined using the substring() function with the selectExpr() function for getting the substring of the column(date) defined as the year, month, and day. Finally, the "dataframe4" is defined using the substring() function with the Column type.

What Users are saying..

profile image

Jingwei Li

Graduate Research assistance at Stony Brook University
linkedin profile url

ProjectPro is an awesome platform that helps me learn much hands-on industrial experience with a step-by-step walkthrough of projects. There are two primary paths to learn: Data Science and Big Data.... Read More

Relevant Projects

Build a big data pipeline with AWS Quicksight, Druid, and Hive
Use the dataset on aviation for analytics to simulate a complex real-world big data pipeline based on messaging with AWS Quicksight, Druid, NiFi, Kafka, and Hive.

PySpark Project-Build a Data Pipeline using Kafka and Redshift
In this PySpark ETL Project, you will learn to build a data pipeline and perform ETL operations by integrating PySpark with Apache Kafka and AWS Redshift

Build Streaming Data Pipeline using Azure Stream Analytics
In this Azure Data Engineering Project, you will learn how to build a real-time streaming platform using Azure Stream Analytics, Azure Event Hub, and Azure SQL database.

Python and MongoDB Project for Beginners with Source Code-Part 1
In this Python and MongoDB Project, you learn to do data analysis using PyMongo on MongoDB Atlas Cluster.

AWS Project-Website Monitoring using AWS Lambda and Aurora
In this AWS Project, you will learn the best practices for website monitoring using AWS services like Lambda, Aurora MySQL, Amazon Dynamo DB and Kinesis.

PySpark Tutorial - Learn to use Apache Spark with Python
PySpark Project-Get a handle on using Python with Spark through this hands-on data processing spark python tutorial.

Build a Real-Time Dashboard with Spark, Grafana, and InfluxDB
Use Spark , Grafana, and InfluxDB to build a real-time e-commerce users analytics dashboard by consuming different events such as user clicks, orders, demographics

Log Analytics Project with Spark Streaming and Kafka
In this spark project, you will use the real-world production logs from NASA Kennedy Space Center WWW server in Florida to perform scalable log analytics with Apache Spark, Python, and Kafka.

Build Classification and Clustering Models with PySpark and MLlib
In this PySpark Project, you will learn to implement pyspark classification and clustering model examples using Spark MLlib.

Project-Driven Approach to PySpark Partitioning Best Practices
In this Big Data Project, you will learn to implement PySpark Partitioning Best Practices.