Create a shallow and deep clone from a Delta table in Databricks

This recipe helps you create a shallow clone and deep clone from a Delta table in Databricks. Delta Lake is an open-source storage layer that brings reliability to data lakes. Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing.

Recipe Objective: How to create a shallow clone and deep clone from a Delta table in Databricks?

Delta Lake is an open-source storage layer that brings reliability to data lakes. Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing. Delta Lake runs on top of your existing data lake and is fully compatible with Apache Spark APIs.

In this recipe, we are going to create shallow clone and deep clone tables. Basically, by shallow clone, the cloned table contains only the metadata data of a source table. For Deep clone, the metadata and data are cloned.

Build a Real-Time Dashboard with Spark, Grafana and Influxdb

Implementation Info:

  1. Databricks Community Edition click here
  2. Spark-scala
  3. storage - Databricks File System(DBFS)

Step 1: Uploading data to DBFS

Follow the below steps to upload data files from local to DBFS

  1. Click create in Databricks menu
  2. Click Table in the drop-down menu, it will open a create new table UI
  3. In UI, specify the folder name in which you want to save your files.
  4. click browse to upload and upload files from local.
  5. path is like /FileStore/tables/your folder name/your file

Refer to the image below for example

bigdata_01.PNG

Step 2: Creation of Delta Table

Here we are creating a delta table "emp_data" by reading the source file uploaded in DBFS. Here we have used StructType() function to impose custom schema over the dataframe. Once the dataframe is created, we write the data into a Delta Table as below.

import org.apache.spark.sql.types._ val schema = new StructType().add("Id",IntegerType).add("Name",StringType) .add("Department",StringType).add("Salary",DoubleType) .add("Doj",TimestampType).add("Date_Updated",DateType) val df = spark.read.schema(schema).csv("/FileStore/tables/sample_emp_data.txt") df.show() df.write.format("delta").mode("overwrite").saveAsTable("default.emp_data")

bigdata_02.PNG

Step 3: Creation of shallow clone table

Here we are creating a delta_shallow_clone table using a shallow clone operation. Shallow clone only clones the metadata of the source table to the cloned table here, i.e., "delta_shallow_clone." If data in the source table is truncated, querying shallow cloned tables results in zero records.

spark.sql("""CREATE TABLE if not exists default.delta_shallow_clone SHALLOW CLONE default.emp_data""") spark.sql("select * from default.delta_shallow_clone").show(truncate = false)

bigdata_03.PNG

By using the desc formatted db.tablename, we will get the details of the table. Here I have used this to get the content location of the "delta_shallow_clone" table. After that, I have used the data file system command to list the files in the folder. And this folder contains only delta_log, nothing but metadata of the source table.

//spark.sql("desc formatted default.delta_shallow_clone") display(dbutils.fs.ls("/user/hive/warehouse/delta_shallow_clone"))

bigdata_04.PNG

Step 4: Creation of deep clone table

Here we are creating a delta_deep_clone table using deep clone operation. Deep clone only clones the metadata and data of the source table to the cloned table here, i.e., "delta_deep_clone." If data in the source table is truncated, then data in the deep-cloned table does not drop.

spark.sql("""CREATE TABLE if not exists default.delta_deep_clone CLONE default.emp_data""") spark.sql("select * from default.delta_deep_clone").show(truncate = false)

bigdata_05.PNG

By using the desc formatted db.tablename, we will get the details of the table. Here I have used this to get the content location of the "delta_deep_clone" table. After that, I have used the data file system command to list the files in the folder. And this folder contains only delta_log and data files of the source table.

//spark.sql("desc formatted default.delta_deep_clone") display(dbutils.fs.ls("/user/hive/warehouse/delta_deep_clone"))

bigdata_06.PNG

Conclusion

Here we have learned different clone mechanisms such as shallow clone and deep clone. We also learned that shallow clone only clones metadata of source table whereas deep clone clones both meta and data of source table. And we also learned the difference between these two clone operations.

What Users are saying..

profile image

Ed Godalle

Director Data Analytics at EY / EY Tech
linkedin profile url

I am the Director of Data Analytics with over 10+ years of IT experience. I have a background in SQL, Python, and Big Data working with Accenture, IBM, and Infosys. I am looking to enhance my skills... Read More

Relevant Projects

Yelp Data Processing using Spark and Hive Part 2
In this spark project, we will continue building the data warehouse from the previous project Yelp Data Processing Using Spark And Hive Part 1 and will do further data processing to develop diverse data products.

Learn How to Implement SCD in Talend to Capture Data Changes
In this Talend Project, you will build an ETL pipeline in Talend to capture data changes using SCD techniques.

Build a Streaming Pipeline with DBT, Snowflake and Kinesis
This dbt project focuses on building a streaming pipeline integrating dbt Cloud, Snowflake and Amazon Kinesis for real-time processing and analysis of Stock Market Data.

Data Processing and Transformation in Hive using Azure VM
Hive Practice Example - Explore hive usage efficiently for data transformation and processing in this big data project using Azure VM.

Migration of MySQL Databases to Cloud AWS using AWS DMS
IoT-based Data Migration Project using AWS DMS and Aurora Postgres aims to migrate real-time IoT-based data from an MySQL database to the AWS cloud.

PySpark Tutorial - Learn to use Apache Spark with Python
PySpark Project-Get a handle on using Python with Spark through this hands-on data processing spark python tutorial.

Analyse Yelp Dataset with Spark & Parquet Format on Azure Databricks
In this Databricks Azure project, you will use Spark & Parquet file formats to analyse the Yelp reviews dataset. As part of this you will deploy Azure data factory, data pipelines and visualise the analysis.

Implementing Slow Changing Dimensions in a Data Warehouse using Hive and Spark
Hive Project- Understand the various types of SCDs and implement these slowly changing dimesnsion in Hadoop Hive and Spark.

Build an ETL Pipeline with DBT, Snowflake and Airflow
Data Engineering Project to Build an ETL pipeline using technologies like dbt, Snowflake, and Airflow, ensuring seamless data extraction, transformation, and loading, with efficient monitoring through Slack and email notifications via SNS

Build a Real-Time Spark Streaming Pipeline on AWS using Scala
In this Spark Streaming project, you will build a real-time spark streaming pipeline on AWS using Scala and Python.