What is Delta table as a stream sink in Databricks

This recipe explains what is Delta table as a stream sink in Databricks
Last Updated: 25 Aug 2022

Get access to Big Data projects View all Big Data projects

APACHE HADOOP PROJECTS DATA CLEANING PYTHON DATA MUNGING MACHINE LEARNING RECIPES PANDAS CHEATSHEET ALL TAGS

Recipe Objective - What is Delta table as a stream sink in Databricks?

The Delta Lake table, defined as the Delta table, is both a batch table and the streaming source and sink. The Streaming data ingest, batch historic backfill, and interactive queries all work out of the box. Delta Lake provides the ability to specify the schema and also enforce it, which further helps ensure that data types are correct and the required columns are present, which also helps in building the delta tables and also preventing the bad data from causing data corruption in both delta lake and delta table. The Delta can write the batch and the streaming data into the same table, allowing a simpler architecture and quicker data ingestion to the query result. Also, the Delta provides the ability to infer the schema for data input which further reduces the effort required in managing the schema changes. The Delta Lake is additionally integrated with Spark Structured Streaming through the "readStream" and "writeStream." The data can be written into the Delta table using the Structured Streaming.

Access Source Code for Airline Dataset Analysis using Hadoop

Further, the transaction log enables Delta Lake to guarantee precisely once processing and even when other streams or the batch queries are running concurrently against the table. So, by default, the streams run in the append mode that adds new records to the table, and also, the Structured Streaming can be used to replace the entire table with every batch. One use case is computing the summary using aggregation.

System Requirements

Scala (2.12 version)
Apache Spark (3.1.1 version)

This recipe explains Delta lake and how the Delta Table is used as a Stream Sink in Spark.

Implementing Data Table as a stream sink

// Importing packages import org.apache.spark.sql.{SaveMode, SparkSession} import io.delta.implicits._

Databricks-1

The spark SQL Savemode & Sparksession package and Delta implicit package, and Delta table package are imported in the environment to stream Delta Table as a stream sink in Databricks.

// Implementing Delta table as a sink object DeltaTableSink extends App { val spark: SparkSession = SparkSession.builder() .master("local[1]") .appName("Spark Sink Delta table") .getOrCreate() spark.sparkContext.setLogLevel("ERROR") // Using Complete mode for Delta table as a sink // Computing a summary using the aggregation spark.readStream .format("delta") .load("/delta/events") .groupBy("customerName") .count() .writeStream .format("delta") .outputMode("complete") .option("checkpointLocation", "/delta/eventsByCustomer/_checkpoints/streaming-agg") .start("/delta/eventsByCustomer") }

Databricks-2

Databricks-3

The DeltaTableSink object is created in which a spark session is initiated. The table is continuously updated, which contains the aggregate number of events by the customer using the Complete mode for the Delta table. The "spark.readStream()" function is used in which a customer's events summary is computed using aggregation and grouped by "Customer name" in "delta" format.

Download Materials

Databricks_1

Databricks_2

Databricks_3

What Users are saying..

Ray han

Tech Leader | Stanford / Yale University

I think that they are fantastic. I attended Yale and Stanford and have worked at Honeywell,Oracle, and Arthur Andersen(Accenture) in the US. I have taken Big Data and Hadoop,NoSQL, Spark, Hadoop... Read More