How to use lit() and typedLit() functions to add constant columns in a dataframe in Databricks

This recipe helps you use lit() and typedLit() functions to add constant columns in a dataframe in Databricks
Last Updated: 22 Jun 2022

Get access to Big Data projects View all Big Data projects

APACHE HADOOP PROJECTS DATA CLEANING PYTHON DATA MUNGING MACHINE LEARNING RECIPES PANDAS CHEATSHEET ALL TAGS

Recipe Objective - How to use lit() and typedLit() functions to add constant column in a dataframe in Databricks?

The Delta Lake table, defined as the Delta table, is a batch table and the streaming source and sink. The Streaming data ingest, batch historic backfill, and interactive queries all work out of the box. Delta Lake provides the ability to specify the schema and also enforce it, which further helps ensure that data types are correct and the required columns are present, which also helps in building the delta tables and also preventing the bad data from causing data corruption in both delta lake and delta table. The Delta can write the batch and the streaming data into the same table, allowing a simpler architecture and quicker data ingestion to the query result. Also, the Delta provides the ability to infer the schema for data input which further reduces the effort required in managing the schema changes. The Spark SQL functions lit() and typedLit() add the new constant column to the DataFrame by assigning the literal or a constant value. Both lit() and typedLit() functions are available in the Spark by importing "org.apache.spark.sql.functions" package and it returns the Column type.

System Requirements

Scala (2.12 version)
Apache Spark (3.1.1 version)

This recipe explains Delta lake and defines the usage of lit() and typedLit() functions to add a constant column in a dataframe in Spark.

Implementing Lit() and TypeLit() functions in Spark

// Importing packages import org.apache.spark.sql.SparkSession import org.apache.spark.sql.types.IntegerType import org.apache.spark.sql.functions._

Databricks-1

Spark SQL Session, Integer type and functions are imported into the environment to perform lit() and typedLit() functions in Databricks.

// Implementing lit() and typedLit() functions object litTypeLit extends App { val spark = SparkSession.builder() .appName("Spark SQL lit() and typeLit()") .master("local") .getOrCreate() import spark.sqlContext.implicits._ val sample_data = Seq(("222",60000),("333",70000),("444",50000)) val dataframe = sample_data.toDF("EmpId","Salary") val dataframe2 = dataframe.select(col("EmpId"),col("Salary"),lit("2").as("lit_funcvalue1")) val dataframe3 = dataframe2.withColumn("lit_funcvalue2", when(col("Salary") >=50000 && col("Salary") <= 60000, lit("200").cast(IntegerType)) .otherwise(lit("300").cast(IntegerType)) ) val dataframe4 = dataframe3.withColumn("typedLit_seq",typedLit(Seq(1, 2, 3))) .withColumn("typedLit_map",typedLit(Map("a" -> 2, "b" -> 1))) .withColumn("typedLit_struct",typedLit(("a", 1, 2.0))) dataframe4.printSchema() dataframe4.show() }

Databricks-2

Databricks-3

The litTypeLit object is defined to perform the functions. The "sample_data" value is defined, which takes sample data as defined by the user. The "dataframe" value is defined, which converts data into Dataframe. The "dataframe2" value is defined, which creates the new column with the constant value using the lit() function. The "dataframe3" value is defined using Spark SQL lit() function and using withColumn to derive the new column based on some conditions. The "dataframe4" value is defined which creates the new column with the collection using Spark SQL typedLit() function that is a new column is created by adding the collection literal Seq(1, 2, 3), Map(“a” -> 2, “b” -> 1) and structure (“a”, 1, 2.0) to the Spark DataFrame.

Download Materials

Databricks_1

Databricks_2

Databricks_3

What Users are saying..

Abhinav Agarwal

Graduate Student at Northwestern University

I come from Northwestern University, which is ranked 9th in the US. Although the high-quality academics at school taught me all the basics I needed, obtaining practical experience was a challenge.... Read More

Relevant Projects

Machine Learning Projects

Data Science Projects

Python Projects for Data Science

Data Science Projects in R

Machine Learning Projects for Beginners

Deep Learning Projects

Neural Network Projects

Tensorflow Projects

NLP Projects

Kaggle Projects

IoT Projects

Big Data Projects

Hadoop Real-Time Projects Examples

Spark Projects

Data Analytics Projects for Students

Relevant Projects

EMR Serverless Example to Build a Search Engine for COVID19

In this AWS Project, create a search engine using the BM25 TF-IDF Algorithm that uses EMR Serverless for ad-hoc processing of a large amount of unstructured textual data.

View Project Details

Learn How to Implement SCD in Talend to Capture Data Changes

In this Talend Project, you will build an ETL pipeline in Talend to capture data changes using SCD techniques.

View Project Details

PySpark Tutorial - Learn to use Apache Spark with Python

PySpark Project-Get a handle on using Python with Spark through this hands-on data processing spark python tutorial.

View Project Details

AWS Project for Batch Processing with PySpark on AWS EMR

In this AWS Project, you will learn how to perform batch processing on Wikipedia data with PySpark on AWS EMR.

View Project Details

Streaming Data Pipeline using Spark, HBase and Phoenix

Build a Real-Time Streaming Data Pipeline for an application that monitors oil wells using Apache Spark, HBase and Apache Phoenix .

View Project Details

Migration of MySQL Databases to Cloud AWS using AWS DMS

IoT-based Data Migration Project using AWS DMS and Aurora Postgres aims to migrate real-time IoT-based data from an MySQL database to the AWS cloud.

View Project Details

Build a Data Pipeline with Azure Synapse and Spark Pool

In this Azure Project, you will learn to build a Data Pipeline in Azure using Azure Synapse Analytics, Azure Storage, Azure Synapse Spark Pool to perform data transformations on an Airline dataset and visualize the results in Power BI.

View Project Details

Build an ETL Pipeline on EMR using AWS CDK and Power BI

In this ETL Project, you will learn build an ETL Pipeline on Amazon EMR with AWS CDK and Apache Hive. You'll deploy the pipeline using S3, Cloud9, and EMR, and then use Power BI to create dynamic visualizations of your transformed data.

View Project Details

Python and MongoDB Project for Beginners with Source Code-Part 2

In this Python and MongoDB Project for Beginners, you will learn how to use Apache Sedona and perform advanced analysis on the Transportation dataset.

View Project Details

Big Data Project for Solving Small File Problem in Hadoop Spark

This big data project focuses on solving the small file problem to optimize data processing efficiency by leveraging Apache Hadoop and Spark within AWS EMR by implementing and demonstrating effective techniques for handling large numbers of small files.

View Project Details

How to use lit() and typedLit() functions to add constant columns in a dataframe in Databricks

Recipe Objective - How to use lit() and typedLit() functions to add constant column in a dataframe in Databricks?

System Requirements

Implementing Lit() and TypeLit() functions in Spark

Abhinav Agarwal

Relevant Projects

You might also like

Relevant Projects