Merge delta table using data deduplication technique Databricks

This recipe helps you merge in Delta Table using the data deduplication technique in Databricks. The Delta Lake table, defined as the Delta table, is both a batch table and the streaming source and sink. The Streaming data ingest, batch historic backfill, and interactive queries all work out of the box.

Recipe Objective - How to merge in Delta Table using data deduplication technique?

The Delta Lake table, defined as the Delta table, is both a batch table and the streaming source and sink. The Streaming data ingest, batch historic backfill, and interactive queries all work out of the box. Delta Lake provides the ability to specify the schema and also enforce it, which further helps ensure that data types are correct and the required columns are present, which also helps in building the delta tables and also preventing the bad data from causing data corruption in both delta lake and delta table. The Delta can write the batch and the streaming data into the same table, allowing a simpler architecture and quicker data ingestion to the query result. Also, the Delta provides the ability to infer the schema for data input which further reduces the effort required in managing the schema changes. The sources can often generate duplicate log records, and the downstream deduplication steps are also needed to take care of them. The common ETL(Extract Transform and Load) use case collects the logs into the Delta, appending them to the table. So, using merge, it can be avoided inserting duplicate records. Also, the duplicate records can optimize the query further by partitioning table by date, as when it is confirmed that the duplicate records will be generated for few days, and then specifying the date range of the target table to match on.

System Requirements

This recipe explains Delta lake and how to merge Delta Table using the data deduplication technique in Spark.

Implementing Merge in Delta table using Data Deduplication

# Importing packages from delta.tables import * from pyspark.sql.functions import *

Databricks-1

The Delta tables and PySpark SQL functions are imported to perform UPSERT(MERGE) in a Delta table in Databricks.

# Implementing Merge in Delta table using Data Deduplication # Logs deltaTable = DeltaTable.forPath(spark, "/data/events_old/") # New Deduped Logs newDedupedLogs = spark.read.format("delta").load("/data/events/") # Executing merge function # Using Data Ddeduplication deltaTable.alias("logs").merge( newDedupedLogs.alias("newDedupedLogs"), "logs.id = newDedupedLogs.id") \ .whenNotMatchedInsertAll() \ .execute()

Databricks-2

Databricks-3

The Logs in a Delta table are present in the path "/data/events_old/" using the "Logs" value. The "newDedupedLogs" value contains Deduped Logs, which are further written in a Delta table stored in the path "/data/events/." The merge function is executed using the two delta tables by matching the "logs.id" with "newDedupedLogs.id".

What Users are saying..

profile image

Savvy Sahai

Data Science Intern, Capgemini
linkedin profile url

As a student looking to break into the field of data engineering and data science, one can get really confused as to which path to take. Very few ways to do it are Google, YouTube, etc. I was one of... Read More

Relevant Projects

Build a Scalable Event Based GCP Data Pipeline using DataFlow
In this GCP project, you will learn to build and deploy a fully-managed(serverless) event-driven data pipeline on GCP using services like Cloud Composer, Google Cloud Storage (GCS), Pub-Sub, Cloud Functions, BigQuery, BigTable

Orchestrate Redshift ETL using AWS Glue and Step Functions
ETL Orchestration on AWS - Use AWS Glue and Step Functions to fetch source data and glean faster analytical insights on Amazon Redshift Cluster

Build an AWS ETL Data Pipeline in Python on YouTube Data
AWS Project - Learn how to build ETL Data Pipeline in Python on YouTube Data using Athena, Glue and Lambda

SQL Project for Data Analysis using Oracle Database-Part 2
In this SQL Project for Data Analysis, you will learn to efficiently analyse data using JOINS and various other operations accessible through SQL in Oracle Database.

Databricks Data Lineage and Replication Management
Databricks Project on data lineage and replication management to help you optimize your data management practices | ProjectPro

Build an Incremental ETL Pipeline with AWS CDK
Learn how to build an Incremental ETL Pipeline with AWS CDK using Cryptocurrency data

AWS CDK Project for Building Real-Time IoT Infrastructure
AWS CDK Project for Beginners to Build Real-Time IoT Infrastructure and migrate and analyze data to

AWS Project-Website Monitoring using AWS Lambda and Aurora
In this AWS Project, you will learn the best practices for website monitoring using AWS services like Lambda, Aurora MySQL, Amazon Dynamo DB and Kinesis.

Streaming Data Pipeline using Spark, HBase and Phoenix
Build a Real-Time Streaming Data Pipeline for an application that monitors oil wells using Apache Spark, HBase and Apache Phoenix .

Real-Time Streaming of Twitter Sentiments AWS EC2 NiFi
Learn to perform 1) Twitter Sentiment Analysis using Spark Streaming, NiFi and Kafka, and 2) Build an Interactive Data Visualization for the analysis using Python Plotly.