Create a shallow and deep clone from a Delta table in Databricks

This recipe helps you create a shallow clone and deep clone from a Delta table in Databricks. Delta Lake is an open-source storage layer that brings reliability to data lakes. Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing.
Last Updated: 26 Dec 2022

Get access to Big Data projects View all Big Data projects

APACHE HADOOP PROJECTS DATA CLEANING PYTHON DATA MUNGING MACHINE LEARNING RECIPES PANDAS CHEATSHEET ALL TAGS

Recipe Objective: How to create a shallow clone and deep clone from a Delta table in Databricks?

Delta Lake is an open-source storage layer that brings reliability to data lakes. Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing. Delta Lake runs on top of your existing data lake and is fully compatible with Apache Spark APIs.

In this recipe, we are going to create shallow clone and deep clone tables. Basically, by shallow clone, the cloned table contains only the metadata data of a source table. For Deep clone, the metadata and data are cloned.

Build a Real-Time Dashboard with Spark, Grafana and Influxdb

Recipe Objective: How to create a shallow clone and deep clone from a Delta table in Databricks?

Implementation Info:

Databricks Community Edition click here
Spark-scala
storage - Databricks File System(DBFS)

Step 1: Uploading data to DBFS

Follow the below steps to upload data files from local to DBFS

Click create in Databricks menu
Click Table in the drop-down menu, it will open a create new table UI
In UI, specify the folder name in which you want to save your files.
click browse to upload and upload files from local.
path is like /FileStore/tables/your folder name/your file

Refer to the image below for example

Step 2: Creation of Delta Table

Here we are creating a delta table "emp_data" by reading the source file uploaded in DBFS. Here we have used StructType() function to impose custom schema over the dataframe. Once the dataframe is created, we write the data into a Delta Table as below.

import org.apache.spark.sql.types._ val schema = new StructType().add("Id",IntegerType).add("Name",StringType) .add("Department",StringType).add("Salary",DoubleType) .add("Doj",TimestampType).add("Date_Updated",DateType) val df = spark.read.schema(schema).csv("/FileStore/tables/sample_emp_data.txt") df.show() df.write.format("delta").mode("overwrite").saveAsTable("default.emp_data")

Step 3: Creation of shallow clone table

Here we are creating a delta_shallow_clone table using a shallow clone operation. Shallow clone only clones the metadata of the source table to the cloned table here, i.e., "delta_shallow_clone." If data in the source table is truncated, querying shallow cloned tables results in zero records.

spark.sql("""CREATE TABLE if not exists default.delta_shallow_clone SHALLOW CLONE default.emp_data""") spark.sql("select * from default.delta_shallow_clone").show(truncate = false)

By using the desc formatted db.tablename, we will get the details of the table. Here I have used this to get the content location of the "delta_shallow_clone" table. After that, I have used the data file system command to list the files in the folder. And this folder contains only delta_log, nothing but metadata of the source table.

//spark.sql("desc formatted default.delta_shallow_clone") display(dbutils.fs.ls("/user/hive/warehouse/delta_shallow_clone"))

Step 4: Creation of deep clone table

Here we are creating a delta_deep_clone table using deep clone operation. Deep clone only clones the metadata and data of the source table to the cloned table here, i.e., "delta_deep_clone." If data in the source table is truncated, then data in the deep-cloned table does not drop.

spark.sql("""CREATE TABLE if not exists default.delta_deep_clone CLONE default.emp_data""") spark.sql("select * from default.delta_deep_clone").show(truncate = false)

By using the desc formatted db.tablename, we will get the details of the table. Here I have used this to get the content location of the "delta_deep_clone" table. After that, I have used the data file system command to list the files in the folder. And this folder contains only delta_log and data files of the source table.

//spark.sql("desc formatted default.delta_deep_clone") display(dbutils.fs.ls("/user/hive/warehouse/delta_deep_clone"))

Conclusion

Here we have learned different clone mechanisms such as shallow clone and deep clone. We also learned that shallow clone only clones metadata of source table whereas deep clone clones both meta and data of source table. And we also learned the difference between these two clone operations.

Download Materials

bigdata_01

bigdata_02

bigdata_03

bigdata_04

bigdata_05

bigdata_06

sample_emp_data

What Users are saying..

Ed Godalle

Director Data Analytics at EY / EY Tech

I am the Director of Data Analytics with over 10+ years of IT experience. I have a background in SQL, Python, and Big Data working with Accenture, IBM, and Infosys. I am looking to enhance my skills... Read More