Merge delta table using data deduplication technique Databricks

This recipe helps you merge in Delta Table using the data deduplication technique in Databricks. The Delta Lake table, defined as the Delta table, is both a batch table and the streaming source and sink. The Streaming data ingest, batch historic backfill, and interactive queries all work out of the box.
Last Updated: 26 Dec 2022

Get access to Big Data projects View all Big Data projects

APACHE HADOOP PROJECTS DATA CLEANING PYTHON DATA MUNGING MACHINE LEARNING RECIPES PANDAS CHEATSHEET ALL TAGS

Recipe Objective - How to merge in Delta Table using data deduplication technique?

The Delta Lake table, defined as the Delta table, is both a batch table and the streaming source and sink. The Streaming data ingest, batch historic backfill, and interactive queries all work out of the box. Delta Lake provides the ability to specify the schema and also enforce it, which further helps ensure that data types are correct and the required columns are present, which also helps in building the delta tables and also preventing the bad data from causing data corruption in both delta lake and delta table. The Delta can write the batch and the streaming data into the same table, allowing a simpler architecture and quicker data ingestion to the query result. Also, the Delta provides the ability to infer the schema for data input which further reduces the effort required in managing the schema changes. The sources can often generate duplicate log records, and the downstream deduplication steps are also needed to take care of them. The common ETL(Extract Transform and Load) use case collects the logs into the Delta, appending them to the table. So, using merge, it can be avoided inserting duplicate records. Also, the duplicate records can optimize the query further by partitioning table by date, as when it is confirmed that the duplicate records will be generated for few days, and then specifying the date range of the target table to match on.

Recipe Objective - How to merge in Delta Table using data deduplication technique?
- System Requirements
- Implementing Merge in Delta table using Data Deduplication

System Requirements

Scala (2.12 version)
Apache Spark (3.1.1 version)

This recipe explains Delta lake and how to merge Delta Table using the data deduplication technique in Spark.

Implementing Merge in Delta table using Data Deduplication

# Importing packages from delta.tables import * from pyspark.sql.functions import *

Databricks-1

The Delta tables and PySpark SQL functions are imported to perform UPSERT(MERGE) in a Delta table in Databricks.

# Implementing Merge in Delta table using Data Deduplication # Logs deltaTable = DeltaTable.forPath(spark, "/data/events_old/") # New Deduped Logs newDedupedLogs = spark.read.format("delta").load("/data/events/") # Executing merge function # Using Data Ddeduplication deltaTable.alias("logs").merge( newDedupedLogs.alias("newDedupedLogs"), "logs.id = newDedupedLogs.id") \ .whenNotMatchedInsertAll() \ .execute()

Databricks-2

Databricks-3

The Logs in a Delta table are present in the path "/data/events_old/" using the "Logs" value. The "newDedupedLogs" value contains Deduped Logs, which are further written in a Delta table stored in the path "/data/events/." The merge function is executed using the two delta tables by matching the "logs.id" with "newDedupedLogs.id".

Download Materials

Databricks_1

Databricks_2

Databricks_3

What Users are saying..

Savvy Sahai

Data Science Intern, Capgemini

As a student looking to break into the field of data engineering and data science, one can get really confused as to which path to take. Very few ways to do it are Google, YouTube, etc. I was one of... Read More

Merge delta table using data deduplication technique Databricks

Recipe Objective - How to merge in Delta Table using data deduplication technique?

Table of Contents

System Requirements

Implementing Merge in Delta table using Data Deduplication

Savvy Sahai

Relevant Projects

You might also like

Relevant Projects