Explain translate and substring function in PySpark in Databricks

This recipe explains what translate and substring function in PySpark in Databricks
Last Updated: 19 Jan 2023

Get access to Big Data projects View all Big Data projects

APACHE SPARK PROJECTS DATA CLEANING PYTHON DATA MUNGING MACHINE LEARNING RECIPES PANDAS CHEATSHEET ALL TAGS

Recipe Objective - Explain translate() and substring() function in PySpark in Databricks?

The Translate() function in Apache PySpark translates any character that matches the given matchString(Defined by the user) in the column by the already given replaceString. It helps in replacing character by nature of the given Dataframe column value. Translate(Column, String, String) were three parameters "Column" which corresponds to Column that is to be applied, "String" which corresponds to matchString that is String to be matched, and "String" which corresponds to replaceString that is String to be replaced with. It returns the Column object. The Substring() function in Apache PySpark is used to extract the substring from a DataFrame string column on the provided position and the length of the string defined by the user. The substring() function can be used with the select() function and selectExpr() function to get the substring of the column(date) as the year, month, and day.

Learn Spark SQL for Relational Big Data Procesing

Recipe Objective - Explain translate() and substring() function in PySpark in Databricks?
- System Requirements
- Implementing the translate() and substring() functions in Databricks in PySpark

System Requirements

Python (3.0 version)
Apache Spark (3.1.1 version)

This recipe explains translate() and substring() functions and how to perform them in PySpark.

Implementing the translate() and substring() functions in Databricks in PySpark

# Importing packages import pyspark.sql from pyspark.sql import SparkSession from pyspark.sql.functions import translate from pyspark.sql.functions import col, substring
Databricks-1

The SparkSession, Translate, and Col, Substring packages are imported in the environment to perform the translate() and Substring()function in PySpark.

# Implementing the translate() and substring() functions in Databricks in PySpark spark = SparkSession.builder.master("local[1]").appName("PySpark Translate() Substring()").getOrCreate() Sample_address = [(1,"15861 Bhagat Singh","RJ"), (2,"45698 Ashoka Road","DE"), (3,"23654 Laxmi Nagar","Bi")] dataframe =spark.createDataFrame(Sample_address,["id","address","state"]) dataframe.show() # Using the translate() function dataframe.withColumn('address', translate('address', '234', 'DEF')) \ .show(truncate=False) # Defining data for Substring() function Sample_data = [(1,"30654128"),(2,"36985215")] Sample_columns = ["id","date"] dataframe1 = spark.createDataFrame(Sample_data, Sample_columns) # Using the Substring() function with select() function dataframe2 = dataframe1.select('date', substring('date', 2,4).alias('year'), \ substring('date', 6,3).alias('month'), \ substring('date', 8,3).alias('day')) # Using the Substring() function with selectExpr() function dataframe3 = dataframe1.selectExpr('date', 'substring(date, 2,4) as year', \ 'substring(date, 6,2) as month', \ 'substring(date, 8,3) as day') # Using the Substring() function with Column type dataframe4 = dataframe1.withColumn('year', col('date').substr(2, 5))\ .withColumn('month',col('date').substr(6, 3))\ .withColumn('day', col('date').substr(8, 3)) dataframe4.show()
Databricks-2
Databricks-3
Databricks-4

The "Sample_address" value is created in which the data is defined. Using the translate() function that is every character of 2 is replaced with the D, three replaced with the C, and 4 replaced with the D on the address column in the dataframe. Further, the "Sample_data" and the "Sample_columns" is defined for the substring() function and "dataframe1" is defined. The "dataframe2" is defined using the substring() function with the select() function . The "dataframe3" is defined using the substring() function with the selectExpr() function for getting the substring of the column(date) defined as the year, month, and day. Finally, the "dataframe4" is defined using the substring() function with the Column type.

Download Materials

Databricks_1

Databricks_2

Databricks_3

Databricks_4

What Users are saying..

Jingwei Li

Graduate Research assistance at Stony Brook University

ProjectPro is an awesome platform that helps me learn much hands-on industrial experience with a step-by-step walkthrough of projects. There are two primary paths to learn: Data Science and Big Data.... Read More

Relevant Projects

Machine Learning Projects

Data Science Projects

Python Projects for Data Science

Data Science Projects in R

Machine Learning Projects for Beginners

Deep Learning Projects

Neural Network Projects

Tensorflow Projects

NLP Projects

Kaggle Projects

IoT Projects

Big Data Projects

Hadoop Real-Time Projects Examples

Spark Projects

Data Analytics Projects for Students

Relevant Projects

Build a big data pipeline with AWS Quicksight, Druid, and Hive

Use the dataset on aviation for analytics to simulate a complex real-world big data pipeline based on messaging with AWS Quicksight, Druid, NiFi, Kafka, and Hive.

View Project Details

PySpark Project-Build a Data Pipeline using Kafka and Redshift

In this PySpark ETL Project, you will learn to build a data pipeline and perform ETL operations by integrating PySpark with Apache Kafka and AWS Redshift

View Project Details

Build Streaming Data Pipeline using Azure Stream Analytics

In this Azure Data Engineering Project, you will learn how to build a real-time streaming platform using Azure Stream Analytics, Azure Event Hub, and Azure SQL database.

View Project Details

Python and MongoDB Project for Beginners with Source Code-Part 1

In this Python and MongoDB Project, you learn to do data analysis using PyMongo on MongoDB Atlas Cluster.

View Project Details

AWS Project-Website Monitoring using AWS Lambda and Aurora

In this AWS Project, you will learn the best practices for website monitoring using AWS services like Lambda, Aurora MySQL, Amazon Dynamo DB and Kinesis.

View Project Details

PySpark Tutorial - Learn to use Apache Spark with Python

PySpark Project-Get a handle on using Python with Spark through this hands-on data processing spark python tutorial.

View Project Details

Build a Real-Time Dashboard with Spark, Grafana, and InfluxDB

Use Spark , Grafana, and InfluxDB to build a real-time e-commerce users analytics dashboard by consuming different events such as user clicks, orders, demographics

View Project Details

Log Analytics Project with Spark Streaming and Kafka

In this spark project, you will use the real-world production logs from NASA Kennedy Space Center WWW server in Florida to perform scalable log analytics with Apache Spark, Python, and Kafka.

View Project Details

Build Classification and Clustering Models with PySpark and MLlib

In this PySpark Project, you will learn to implement pyspark classification and clustering model examples using Spark MLlib.

View Project Details

Project-Driven Approach to PySpark Partitioning Best Practices

In this Big Data Project, you will learn to implement PySpark Partitioning Best Practices.

View Project Details

Explain translate and substring function in PySpark in Databricks

Recipe Objective - Explain translate() and substring() function in PySpark in Databricks?

Table of Contents

System Requirements

Implementing the translate() and substring() functions in Databricks in PySpark

Jingwei Li

Relevant Projects

You might also like

Relevant Projects