Explain translate and substring function in PySpark in Databricks

This recipe explains what translate and substring function in PySpark in Databricks
Last Updated: 19 Jan 2023

Get access to Big Data projects View all Big Data projects

APACHE SPARK PROJECTS DATA CLEANING PYTHON DATA MUNGING MACHINE LEARNING RECIPES PANDAS CHEATSHEET ALL TAGS

Recipe Objective - Explain translate() and substring() function in PySpark in Databricks?

The Translate() function in Apache PySpark translates any character that matches the given matchString(Defined by the user) in the column by the already given replaceString. It helps in replacing character by nature of the given Dataframe column value. Translate(Column, String, String) were three parameters "Column" which corresponds to Column that is to be applied, "String" which corresponds to matchString that is String to be matched, and "String" which corresponds to replaceString that is String to be replaced with. It returns the Column object. The Substring() function in Apache PySpark is used to extract the substring from a DataFrame string column on the provided position and the length of the string defined by the user. The substring() function can be used with the select() function and selectExpr() function to get the substring of the column(date) as the year, month, and day.

Learn Spark SQL for Relational Big Data Procesing

Recipe Objective - Explain translate() and substring() function in PySpark in Databricks?
- System Requirements
- Implementing the translate() and substring() functions in Databricks in PySpark

System Requirements

Python (3.0 version)
Apache Spark (3.1.1 version)

This recipe explains translate() and substring() functions and how to perform them in PySpark.

Implementing the translate() and substring() functions in Databricks in PySpark

# Importing packages import pyspark.sql from pyspark.sql import SparkSession from pyspark.sql.functions import translate from pyspark.sql.functions import col, substring
Databricks-1

The SparkSession, Translate, and Col, Substring packages are imported in the environment to perform the translate() and Substring()function in PySpark.

# Implementing the translate() and substring() functions in Databricks in PySpark spark = SparkSession.builder.master("local[1]").appName("PySpark Translate() Substring()").getOrCreate() Sample_address = [(1,"15861 Bhagat Singh","RJ"), (2,"45698 Ashoka Road","DE"), (3,"23654 Laxmi Nagar","Bi")] dataframe =spark.createDataFrame(Sample_address,["id","address","state"]) dataframe.show() # Using the translate() function dataframe.withColumn('address', translate('address', '234', 'DEF')) \ .show(truncate=False) # Defining data for Substring() function Sample_data = [(1,"30654128"),(2,"36985215")] Sample_columns = ["id","date"] dataframe1 = spark.createDataFrame(Sample_data, Sample_columns) # Using the Substring() function with select() function dataframe2 = dataframe1.select('date', substring('date', 2,4).alias('year'), \ substring('date', 6,3).alias('month'), \ substring('date', 8,3).alias('day')) # Using the Substring() function with selectExpr() function dataframe3 = dataframe1.selectExpr('date', 'substring(date, 2,4) as year', \ 'substring(date, 6,2) as month', \ 'substring(date, 8,3) as day') # Using the Substring() function with Column type dataframe4 = dataframe1.withColumn('year', col('date').substr(2, 5))\ .withColumn('month',col('date').substr(6, 3))\ .withColumn('day', col('date').substr(8, 3)) dataframe4.show()
Databricks-2
Databricks-3
Databricks-4

The "Sample_address" value is created in which the data is defined. Using the translate() function that is every character of 2 is replaced with the D, three replaced with the C, and 4 replaced with the D on the address column in the dataframe. Further, the "Sample_data" and the "Sample_columns" is defined for the substring() function and "dataframe1" is defined. The "dataframe2" is defined using the substring() function with the select() function . The "dataframe3" is defined using the substring() function with the selectExpr() function for getting the substring of the column(date) defined as the year, month, and day. Finally, the "dataframe4" is defined using the substring() function with the Column type.

Download Materials

Databricks_1

Databricks_2

Databricks_3

Databricks_4

What Users are saying..

Ed Godalle

Director Data Analytics at EY / EY Tech

I am the Director of Data Analytics with over 10+ years of IT experience. I have a background in SQL, Python, and Big Data working with Accenture, IBM, and Infosys. I am looking to enhance my skills... Read More

Relevant Projects

Machine Learning Projects

Data Science Projects

Python Projects for Data Science

Data Science Projects in R

Machine Learning Projects for Beginners

Deep Learning Projects

Neural Network Projects

Tensorflow Projects

NLP Projects

Kaggle Projects

IoT Projects

Big Data Projects

Hadoop Real-Time Projects Examples

Spark Projects

Data Analytics Projects for Students

Relevant Projects

Build an Incremental ETL Pipeline with AWS CDK

Learn how to build an Incremental ETL Pipeline with AWS CDK using Cryptocurrency data

View Project Details

Learn Real-Time Data Ingestion with Azure Purview

In this Microsoft Azure project, you will learn data ingestion and preparation for Azure Purview.

View Project Details

Build a Scalable Event Based GCP Data Pipeline using DataFlow

In this GCP project, you will learn to build and deploy a fully-managed(serverless) event-driven data pipeline on GCP using services like Cloud Composer, Google Cloud Storage (GCS), Pub-Sub, Cloud Functions, BigQuery, BigTable

View Project Details

AWS CDK and IoT Core for Migrating IoT-Based Data to AWS

Learn how to use AWS CDK and various AWS services to replicate an On-Premise Data Center infrastructure by ingesting real-time IoT-based.

View Project Details

Build Serverless Pipeline using AWS CDK and Lambda in Python

In this AWS Data Engineering Project, you will learn to build a serverless pipeline using AWS CDK and other AWS serverless technologies like AWS Lambda and Glue.

View Project Details

Azure Data Factory and Databricks End-to-End Project

Azure Data Factory and Databricks End-to-End Project to implement analytics on trip transaction data using Azure Services such as Data Factory, ADLS Gen2, and Databricks, with a focus on data transformation and pipeline resiliency.

View Project Details

Hadoop Project-Analysis of Yelp Dataset using Hadoop Hive

The goal of this hadoop project is to apply some data engineering principles to Yelp Dataset in the areas of processing, storage, and retrieval.

View Project Details

Getting Started with Pyspark on AWS EMR and Athena

In this AWS Big Data Project, you will learn to perform Spark Transformations using a real-time currency ticker API and load the processed data to Athena using Glue Crawler.

View Project Details

dbt Snowflake Project to Master dbt Fundamentals in Snowflake

DBT Snowflake Project to Master the Fundamentals of DBT and learn how it can be used to build efficient and robust data pipelines with Snowflake.

View Project Details

Build a Real-Time Dashboard with Spark, Grafana, and InfluxDB

Use Spark , Grafana, and InfluxDB to build a real-time e-commerce users analytics dashboard by consuming different events such as user clicks, orders, demographics

View Project Details

Explain translate and substring function in PySpark in Databricks

Recipe Objective - Explain translate() and substring() function in PySpark in Databricks?

Table of Contents

System Requirements

Implementing the translate() and substring() functions in Databricks in PySpark

Ed Godalle

Relevant Projects

You might also like

Relevant Projects