Datediff and monthsbetween functions in PySpark in Databricks

In this recipe, a detailed demonstration of how to use the datediff and monthsbetween functions in PySpark for Databricks is given. Also how to implement this in Python is explained with the help of an example
Last Updated: 21 Aug 2023

Get access to Big Data projects View all Big Data projects

APACHE SPARK PROJECTS DATA CLEANING PYTHON DATA MUNGING MACHINE LEARNING RECIPES PANDAS CHEATSHEET ALL TAGS

Recipe Objective - Explain datediff() and months_between() functions in PySpark in Databricks?

The date diff() function in Apache PySpark is popularly used to get the difference of dates and the number of days between the dates specified. The "datediff(date, date)" is the syntax of the datediff() function where the first argument specifies the input of the Date and the Second argument specifies an additional Date argument from which the first argument is differentiated. The months_between() function in Apache PySpark is popularly used to get the difference of dates and the number of months between the dates specified. The "months_between(date, date)" is the syntax of the months_between() function where the first argument specifies the input of the Date and the Second argument specifies an additional Date argument from which the first argument is differentiated.

Learn Spark SQL for Relational Big Data Procesing

Recipe Objective - Explain datediff() and months_between() functions in PySpark in Databricks?
- System Requirements
- Implementing the datediff() and months_between() function in Databricks in PySpark

System Requirements

Python (3.0 version)
Apache Spark (3.1.1 version)

This recipe explains The datediff() function, months_between() function, and how to perform them in PySpark.

Implementing the datediff() and months_between() function in Databricks in PySpark

# Importing package import pyspark.sql from pyspark.sql import SparkSession from pyspark.sql.functions import * Databricks-1

The SparkSession and all packages are imported in the environment to perform usage of datediff() and months_between() functions in PySpark.

# Implementing the datediff() and months_between() functions in Databricks in PySpark spark = SparkSession.builder \ .appName('PySpark datediff() and months_between()') \ .getOrCreate() data = [("1","2021-04-01"),("2","2021-06-07"),("3","2021-08-09")] dataframe = spark.createDataFrame(data = data,schema=["id","date"]) # Using datediff() function dataframe.select( col("date"), current_date().alias(Current_Date"), datediff(current_date(),col("date")).alias("datediff") ).show() # Using months_between() function dataframe.withColumn("Dates_Diff", datediff(current_date(),col("date"))) \ .withColumn("Months_Diff", months_between(current_date(),col("date"))) \ .withColumn("Months_Diff_round",round(months_between(current_date(),col("date")),3)) \ .withColumn("Years_Diff",months_between(current_date(),col("date"))/lit(12)) \ .withColumn("Years_Diff_round",round(months_between(current_date(),col("date"))/lit(12),3)) \ .show() # Defining data in other than default # PySpark date format data2 = [("1","01-04-2021"),("2","07-06-2021"),("3","09-08-2021")] dataframe2 = spark.createDataFrame(data=data2,schema=["id","date"]) # Converting the data into default PySpark data format dataframe2.select( to_date(col("date"),"MM-dd-yyyy").alias("date"), current_date().alias("endDate") ) Databricks-2 Databricks-3

The "dataframe" value is created in which the data is defined. Using the date diff() function to get the difference of dates is getting the number of days between the dates specified. Further, alias like "datediff" is defined to identify the datediff() function output column easily. Using the months_between() function, getting the difference of dates is getting the number of months between the dates specified. The alias like "Dates_Diff", "Months_Diff", "Months_Diff_round", "Years_Diff" and "Years_Diff_round" are defined to define the output of months_between() function. The "dataframe2" value is created in which the data is defined not in their default format of "YYYY-MM-DD." The input dates are converted to their default format of "YYYY-MM-DD" using the to_date() function on the dataframe2 else; the output will be NULL.

Download Materials

Databricks_1

Databricks_2

Databricks_3

What Users are saying..

Anand Kumpatla

Sr Data Scientist @ Doubleslash Software Solutions Pvt Ltd

ProjectPro is a unique platform and helps many people in the industry to solve real-life problems with a step-by-step walkthrough of projects. A platform with some fantastic resources to gain... Read More

Relevant Projects

Machine Learning Projects

Data Science Projects

Python Projects for Data Science

Data Science Projects in R

Machine Learning Projects for Beginners

Deep Learning Projects

Neural Network Projects

Tensorflow Projects

NLP Projects

Kaggle Projects

IoT Projects

Big Data Projects

Hadoop Real-Time Projects Examples

Spark Projects

Data Analytics Projects for Students

Relevant Projects

Orchestrate Redshift ETL using AWS Glue and Step Functions

ETL Orchestration on AWS - Use AWS Glue and Step Functions to fetch source data and glean faster analytical insights on Amazon Redshift Cluster

View Project Details

Build a big data pipeline with AWS Quicksight, Druid, and Hive

Use the dataset on aviation for analytics to simulate a complex real-world big data pipeline based on messaging with AWS Quicksight, Druid, NiFi, Kafka, and Hive.

View Project Details

Getting Started with Azure Purview for Data Governance

In this Microsoft Azure Purview Project, you will learn how to consume the ingested data and perform analysis to find insights.

View Project Details

Build a Scalable Event Based GCP Data Pipeline using DataFlow

In this GCP project, you will learn to build and deploy a fully-managed(serverless) event-driven data pipeline on GCP using services like Cloud Composer, Google Cloud Storage (GCS), Pub-Sub, Cloud Functions, BigQuery, BigTable

View Project Details

SQL Project for Data Analysis using Oracle Database-Part 1

In this SQL Project for Data Analysis, you will learn to efficiently leverage various analytical features and functions accessible through SQL in Oracle Database

View Project Details

Building Data Pipelines in Azure with Azure Synapse Analytics

In this Microsoft Azure Data Engineering Project, you will learn how to build a data pipeline using Azure Synapse Analytics, Azure Storage and Azure Synapse SQL pool to perform data analysis on the 2021 Olympics dataset.

View Project Details

Real-time Auto Tracking with Spark-Redis

Spark Project - Discuss real-time monitoring of taxis in a city. The real-time data streaming will be simulated using Flume. The ingestion will be done using Spark Streaming.

View Project Details

Graph Database Modelling using AWS Neptune and Gremlin

In this data analytics project, you will use AWS Neptune graph database and Gremlin query language to analyse various performance metrics of flights.

View Project Details

How to deal with slowly changing dimensions using snowflake?

Implement Slowly Changing Dimensions using Snowflake Method - Build Type 1 and Type 2 SCD in Snowflake using the Stream and Task Functionalities

View Project Details

Flask API Big Data Project using Databricks and Unity Catalog

In this Flask Project, you will use Flask APIs, Databricks, and Unity Catalog to build a secure data processing platform focusing on climate data. You will also explore advanced features like Docker containerization, data encryption, and detailed data lineage tracking.

View Project Details

Datediff and monthsbetween functions in PySpark in Databricks

Recipe Objective - Explain datediff() and months_between() functions in PySpark in Databricks?

Table of Contents

System Requirements

Implementing the datediff() and months_between() function in Databricks in PySpark

Anand Kumpatla

Relevant Projects

You might also like

Relevant Projects