Read older versions of the data using time travel in Databricks

This recipe helps you read older versions of the data using time travel in Databricks. The Delta Lake table, defined as the Delta table, is both a batch table and the streaming source and sink.

Recipe Objective - How to read older versions of the data using time travel in Databricks?

The Delta Lake table, defined as the Delta table, is both a batch table and the streaming source and sink. The Streaming data ingest, batch historic backfill, and interactive queries all work out of the box. Delta Lake provides the ability to specify the schema and also enforce it, which further helps ensure that data types are correct and the required columns are present, which also helps in building the delta tables and also preventing the bad data from causing data corruption in both delta lake and delta table. The Delta can write the batch and the streaming data into the same table, allowing a simpler architecture and quicker data ingestion to the query result. Also, the Delta provides the ability to infer the schema for data input which further reduces the effort required in managing the schema changes. The previous snapshots of the Delta table can be queried by using the time travel method that is an older version of the data that can be easily accessed. Time travel takes advantage of the power of the Delta Lake transaction log for accessing data that is no longer in the table.

System Requirements

  • Scala (2.12 version)
  • Apache Spark (3.1.1 version)

This recipe explains Delta lake and how to read the older versions of data using the time travel in Spark.

Implementing reading of older version of data in Delta Table

// Importing packages import org.apache.spark.sql.{SaveMode, SparkSession} import io.delta.implicits._

Databricks-1

The spark SQL Savemode & Sparksession package and Delta implicit package, and Delta table package are imported in the environment to read older versions of data using time travel in Databricks.

// Implementing reading of older version of data in Delta Table object DeltaTableOlderVersions extends App { val spark: SparkSession = SparkSession.builder() .master("local[1]") .appName("Spark Read Older Version Delta table") .getOrCreate() spark.sparkContext.setLogLevel("ERROR") // Reading older versions of data of Delta table val read_older = spark.read.format("delta").option("versionAsOf", 0).load("/delta/events") read_older.show() }

Databricks-2

Databricks-3

The DeltaTableOlderVersions is created in which a spark session is initiated. The Delta table from the path "/delta/events" is loaded using the "spark.read.format()" function and is used for reading older versions of data from the Deta table while creating value "read_older" which reads the table using time travel. Further, the first set of data is queried using "versionAsOf" option in "spark.read.format()" function. The version 0 data is displayed. Version 1 can be used that will display the new data.

What Users are saying..

profile image

Jingwei Li

Graduate Research assistance at Stony Brook University
linkedin profile url

ProjectPro is an awesome platform that helps me learn much hands-on industrial experience with a step-by-step walkthrough of projects. There are two primary paths to learn: Data Science and Big Data.... Read More

Relevant Projects

Analyse Yelp Dataset with Spark & Parquet Format on Azure Databricks
In this Databricks Azure project, you will use Spark & Parquet file formats to analyse the Yelp reviews dataset. As part of this you will deploy Azure data factory, data pipelines and visualise the analysis.

SQL Project for Data Analysis using Oracle Database-Part 1
In this SQL Project for Data Analysis, you will learn to efficiently leverage various analytical features and functions accessible through SQL in Oracle Database

Hadoop Project-Analysis of Yelp Dataset using Hadoop Hive
The goal of this hadoop project is to apply some data engineering principles to Yelp Dataset in the areas of processing, storage, and retrieval.

Build an ETL Pipeline for Financial Data Analytics on GCP-IaC
In this GCP Project, you will learn to build an ETL pipeline on Google Cloud Platform to maximize the efficiency of financial data analytics with GCP-IaC.

Real-Time Streaming of Twitter Sentiments AWS EC2 NiFi
Learn to perform 1) Twitter Sentiment Analysis using Spark Streaming, NiFi and Kafka, and 2) Build an Interactive Data Visualization for the analysis using Python Plotly.

Data Processing and Transformation in Hive using Azure VM
Hive Practice Example - Explore hive usage efficiently for data transformation and processing in this big data project using Azure VM.

Python and MongoDB Project for Beginners with Source Code-Part 2
In this Python and MongoDB Project for Beginners, you will learn how to use Apache Sedona and perform advanced analysis on the Transportation dataset.

SQL Project for Data Analysis using Oracle Database-Part 3
In this SQL Project for Data Analysis, you will learn to efficiently write sub-queries and analyse data using various SQL functions and operators.

Snowflake Real Time Data Warehouse Project for Beginners-1
In this Snowflake Data Warehousing Project, you will learn to implement the Snowflake architecture and build a data warehouse in the cloud to deliver business value.

Learn to Build Regression Models with PySpark and Spark MLlib
In this PySpark Project, you will learn to implement regression machine learning models in SparkMLlib.