How to use lit() and typedLit() functions to add constant columns in a dataframe in Databricks

This recipe helps you use lit() and typedLit() functions to add constant columns in a dataframe in Databricks

Recipe Objective - How to use lit() and typedLit() functions to add constant column in a dataframe in Databricks?

The Delta Lake table, defined as the Delta table, is a batch table and the streaming source and sink. The Streaming data ingest, batch historic backfill, and interactive queries all work out of the box. Delta Lake provides the ability to specify the schema and also enforce it, which further helps ensure that data types are correct and the required columns are present, which also helps in building the delta tables and also preventing the bad data from causing data corruption in both delta lake and delta table. The Delta can write the batch and the streaming data into the same table, allowing a simpler architecture and quicker data ingestion to the query result. Also, the Delta provides the ability to infer the schema for data input which further reduces the effort required in managing the schema changes. The Spark SQL functions lit() and typedLit() add the new constant column to the DataFrame by assigning the literal or a constant value. Both lit() and typedLit() functions are available in the Spark by importing "org.apache.spark.sql.functions" package and it returns the Column type.

System Requirements

  • Scala (2.12 version)
  • Apache Spark (3.1.1 version)

This recipe explains Delta lake and defines the usage of lit() and typedLit() functions to add a constant column in a dataframe in Spark.

Implementing Lit() and TypeLit() functions in Spark

// Importing packages import org.apache.spark.sql.SparkSession import org.apache.spark.sql.types.IntegerType import org.apache.spark.sql.functions._

Databricks-1

Spark SQL Session, Integer type and functions are imported into the environment to perform lit() and typedLit() functions in Databricks.

// Implementing lit() and typedLit() functions object litTypeLit extends App { val spark = SparkSession.builder() .appName("Spark SQL lit() and typeLit()") .master("local") .getOrCreate() import spark.sqlContext.implicits._ val sample_data = Seq(("222",60000),("333",70000),("444",50000)) val dataframe = sample_data.toDF("EmpId","Salary") val dataframe2 = dataframe.select(col("EmpId"),col("Salary"),lit("2").as("lit_funcvalue1")) val dataframe3 = dataframe2.withColumn("lit_funcvalue2", when(col("Salary") >=50000 && col("Salary") <= 60000, lit("200").cast(IntegerType)) .otherwise(lit("300").cast(IntegerType)) ) val dataframe4 = dataframe3.withColumn("typedLit_seq",typedLit(Seq(1, 2, 3))) .withColumn("typedLit_map",typedLit(Map("a" -> 2, "b" -> 1))) .withColumn("typedLit_struct",typedLit(("a", 1, 2.0))) dataframe4.printSchema() dataframe4.show() }

Databricks-2

Databricks-3

The litTypeLit object is defined to perform the functions. The "sample_data" value is defined, which takes sample data as defined by the user. The "dataframe" value is defined, which converts data into Dataframe. The "dataframe2" value is defined, which creates the new column with the constant value using the lit() function. The "dataframe3" value is defined using Spark SQL lit() function and using withColumn to derive the new column based on some conditions. The "dataframe4" value is defined which creates the new column with the collection using Spark SQL typedLit() function that is a new column is created by adding the collection literal Seq(1, 2, 3), Map(“a” -> 2, “b” -> 1) and structure (“a”, 1, 2.0) to the Spark DataFrame.

What Users are saying..

profile image

Abhinav Agarwal

Graduate Student at Northwestern University
linkedin profile url

I come from Northwestern University, which is ranked 9th in the US. Although the high-quality academics at school taught me all the basics I needed, obtaining practical experience was a challenge.... Read More

Relevant Projects

EMR Serverless Example to Build a Search Engine for COVID19
In this AWS Project, create a search engine using the BM25 TF-IDF Algorithm that uses EMR Serverless for ad-hoc processing of a large amount of unstructured textual data.

Learn How to Implement SCD in Talend to Capture Data Changes
In this Talend Project, you will build an ETL pipeline in Talend to capture data changes using SCD techniques.

PySpark Tutorial - Learn to use Apache Spark with Python
PySpark Project-Get a handle on using Python with Spark through this hands-on data processing spark python tutorial.

AWS Project for Batch Processing with PySpark on AWS EMR
In this AWS Project, you will learn how to perform batch processing on Wikipedia data with PySpark on AWS EMR.

Streaming Data Pipeline using Spark, HBase and Phoenix
Build a Real-Time Streaming Data Pipeline for an application that monitors oil wells using Apache Spark, HBase and Apache Phoenix .

Migration of MySQL Databases to Cloud AWS using AWS DMS
IoT-based Data Migration Project using AWS DMS and Aurora Postgres aims to migrate real-time IoT-based data from an MySQL database to the AWS cloud.

Build a Data Pipeline with Azure Synapse and Spark Pool
In this Azure Project, you will learn to build a Data Pipeline in Azure using Azure Synapse Analytics, Azure Storage, Azure Synapse Spark Pool to perform data transformations on an Airline dataset and visualize the results in Power BI.

Build an ETL Pipeline on EMR using AWS CDK and Power BI
In this ETL Project, you will learn build an ETL Pipeline on Amazon EMR with AWS CDK and Apache Hive. You'll deploy the pipeline using S3, Cloud9, and EMR, and then use Power BI to create dynamic visualizations of your transformed data.

Python and MongoDB Project for Beginners with Source Code-Part 2
In this Python and MongoDB Project for Beginners, you will learn how to use Apache Sedona and perform advanced analysis on the Transportation dataset.

Big Data Project for Solving Small File Problem in Hadoop Spark
This big data project focuses on solving the small file problem to optimize data processing efficiency by leveraging Apache Hadoop and Spark within AWS EMR by implementing and demonstrating effective techniques for handling large numbers of small files.