What is Delta table as a stream sink in Databricks

This recipe explains what is Delta table as a stream sink in Databricks

Recipe Objective - What is Delta table as a stream sink in Databricks?

The Delta Lake table, defined as the Delta table, is both a batch table and the streaming source and sink. The Streaming data ingest, batch historic backfill, and interactive queries all work out of the box. Delta Lake provides the ability to specify the schema and also enforce it, which further helps ensure that data types are correct and the required columns are present, which also helps in building the delta tables and also preventing the bad data from causing data corruption in both delta lake and delta table. The Delta can write the batch and the streaming data into the same table, allowing a simpler architecture and quicker data ingestion to the query result. Also, the Delta provides the ability to infer the schema for data input which further reduces the effort required in managing the schema changes. The Delta Lake is additionally integrated with Spark Structured Streaming through the "readStream" and "writeStream." The data can be written into the Delta table using the Structured Streaming.

Access Source Code for Airline Dataset Analysis using Hadoop

Further, the transaction log enables Delta Lake to guarantee precisely once processing and even when other streams or the batch queries are running concurrently against the table. So, by default, the streams run in the append mode that adds new records to the table, and also, the Structured Streaming can be used to replace the entire table with every batch. One use case is computing the summary using aggregation.

System Requirements

  • Scala (2.12 version)
  • Apache Spark (3.1.1 version)

This recipe explains Delta lake and how the Delta Table is used as a Stream Sink in Spark.

Implementing Data Table as a stream sink

// Importing packages import org.apache.spark.sql.{SaveMode, SparkSession} import io.delta.implicits._

Databricks-1

The spark SQL Savemode & Sparksession package and Delta implicit package, and Delta table package are imported in the environment to stream Delta Table as a stream sink in Databricks.

// Implementing Delta table as a sink object DeltaTableSink extends App { val spark: SparkSession = SparkSession.builder() .master("local[1]") .appName("Spark Sink Delta table") .getOrCreate() spark.sparkContext.setLogLevel("ERROR") // Using Complete mode for Delta table as a sink // Computing a summary using the aggregation spark.readStream .format("delta") .load("/delta/events") .groupBy("customerName") .count() .writeStream .format("delta") .outputMode("complete") .option("checkpointLocation", "/delta/eventsByCustomer/_checkpoints/streaming-agg") .start("/delta/eventsByCustomer") }

Databricks-2

Databricks-3

The DeltaTableSink object is created in which a spark session is initiated. The table is continuously updated, which contains the aggregate number of events by the customer using the Complete mode for the Delta table. The "spark.readStream()" function is used in which a customer's events summary is computed using aggregation and grouped by "Customer name" in "delta" format.

What Users are saying..

profile image

Ray han

Tech Leader | Stanford / Yale University
linkedin profile url

I think that they are fantastic. I attended Yale and Stanford and have worked at Honeywell,Oracle, and Arthur Andersen(Accenture) in the US. I have taken Big Data and Hadoop,NoSQL, Spark, Hadoop... Read More

Relevant Projects

Airline Dataset Analysis using Hadoop, Hive, Pig and Athena
Hadoop Project- Perform basic big data analysis on airline dataset using big data tools -Pig, Hive and Athena.

Hadoop Project-Analysis of Yelp Dataset using Hadoop Hive
The goal of this hadoop project is to apply some data engineering principles to Yelp Dataset in the areas of processing, storage, and retrieval.

Build Streaming Data Pipeline using Azure Stream Analytics
In this Azure Data Engineering Project, you will learn how to build a real-time streaming platform using Azure Stream Analytics, Azure Event Hub, and Azure SQL database.

SQL Project for Data Analysis using Oracle Database-Part 2
In this SQL Project for Data Analysis, you will learn to efficiently analyse data using JOINS and various other operations accessible through SQL in Oracle Database.

Build an ETL Pipeline with DBT, Snowflake and Airflow
Data Engineering Project to Build an ETL pipeline using technologies like dbt, Snowflake, and Airflow, ensuring seamless data extraction, transformation, and loading, with efficient monitoring through Slack and email notifications via SNS

Web Server Log Processing using Hadoop in Azure
In this big data project, you will use Hadoop, Flume, Spark and Hive to process the Web Server logs dataset to glean more insights on the log data.

Hands-On Real Time PySpark Project for Beginners
In this PySpark project, you will learn about fundamental Spark architectural concepts like Spark Sessions, Transformation, Actions, and Optimization Techniques using PySpark

Build a big data pipeline with AWS Quicksight, Druid, and Hive
Use the dataset on aviation for analytics to simulate a complex real-world big data pipeline based on messaging with AWS Quicksight, Druid, NiFi, Kafka, and Hive.

GCP Project to Learn using BigQuery for Exploring Data
Learn using GCP BigQuery for exploring and preparing data for analysis and transformation of your datasets.

AWS Project for Batch Processing with PySpark on AWS EMR
In this AWS Project, you will learn how to perform batch processing on Wikipedia data with PySpark on AWS EMR.