What is PySpark Fillna and Fill Function in Databricks?

This recipe covers the PySpark Fillna and Fill function in Databricks - Learn how to handle missing data efficiently!

Recipe Objective - What is PySpark Fillna and Fill Function in Databricks? 

Apache Spark is a powerful open-source data processing framework that has gained immense popularity in the world of big data and analytics. PySpark, the Python API for Apache Spark, makes it easier for data scientists and engineers to work with large datasets and perform various data manipulation tasks. In Databricks, a cloud-based platform for big data analytics and machine learning, PySpark is widely used for data processing and analysis. One essential aspect of PySpark data manipulation is handling missing data, and this is where the fillna and fill functions come into play. 

Explore PySpark Machine Learning Tutorial to take your PySpark skills to the next level!

‘fillna’ Function in PySpark

The fillna function in PySpark is a versatile tool for dealing with missing values in a DataFrame. It allows you to replace or fill in null values with specified values. Here's a breakdown of how to use the fillna function in Databricks: 

Basic Syntax

The basic syntax for using fillna is as follows:

DataFrame.fillna(value, subset=None)

  • value: The value to replace nulls with. This can be a scalar value (e.g., a string, integer, or float) or a dictionary specifying replacement values for specific columns.

  • subset (optional): A list of column names to restrict the replacement to. If not specified, the operation is applied to all columns in the DataFrame.

Examples 

  1. Replace Null with Empty String

df.fillna('')

This will replace all null values in the DataFrame df with empty strings.

  1. Replace Null with 0 in a Specific Column

df.fillna({'column_name': 0})

This will replace null values in the specified column with the value 0.

  1. Replace Null with Different Values in Multiple Columns

df.fillna({'col1': 'A', 'col2': 999})

This will replace null values in 'col1' with 'A' and in 'col2' with 999.

‘fill’ Function in PySpark

The fill function is another method in PySpark for filling missing or null values in a DataFrame. It's similar to fillna, but there are some differences to note.

Basic Syntax

The basic syntax for using fill is as follows:

DataFrame.fill(value, *cols)

  • value: The value to replace nulls with, which should be a scalar value.

  • *cols: A variable-length argument list of column names to restrict the replacement to.

Examples

  1. Replace Null with 0 in a Specific Column

df.fill(0, 'column_name')

This will replace null values in the specified column with the value 0.

  1. Replace Null with 0 in Multiple Columns

df.fill(0, 'col1', 'col2')

This will replace null values in 'col1' and 'col2' with 0. 

End-to-End Big Data Project to Learn PySpark SQL Functions

Key Differences Between ‘Fill’ and ‘Fillna’ Function in PySpark 

The key differences between the fillna and fill functions in PySpark lie in their flexibility and the way they handle missing values. fillna offers more versatility by allowing you to specify replacement values as a dictionary, making it suitable for handling different values in multiple columns. It also provides the option to use an optional subset parameter to target specific columns for replacement. On the other hand, the fill function is simpler and more concise, expecting a single scalar value and accepting a variable-length argument list of columns for replacement. While fill is suitable for straightforward and uniform replacements, fillna excels in more complex scenarios, offering greater customization and control over how null values are handled in a DataFrame. 

How to Implement PySpark Fillna() and Fill() Function in Databricks? 

System Requirements

  • Python (3.0 version)

  • Apache Spark (3.1.1 version)

Explore SQL Database Projects to Add them to Your Data Engineer Resume.

Implementing the fillna() function and fill() function in Databricks in PySpark 

# Importing packages
import pyspark
from pyspark.sql import SparkSession, Row
from pyspark.sql.types import MapType, StringType
from pyspark.sql.functions import col
from pyspark.sql.types import StructType,StructField, StringType


Databricks-1

The Sparksession, Row, MapType, StringType, col, explode, StructType, StructField, StringType are imported in the environment so as to use fillna() function and fill() function in PySpark .

Learn to Transform your data pipeline with Azure Data Factory!

# Implementing the fillna() function and fill() function in Databricks in PySpark
spark = SparkSession.builder \
.master("local[1]") \
.appName("fillna() and fill() PySpark") \
.getOrCreate()
file_Path = "/FileStore/tables/smallzipcode.csv"
dataframe = spark.read.options(header='true', inferSchema='true') \
.csv(file_Path)
dataframe.printSchema()
dataframe.show(truncate=False)
# Using fillna() and fill() functions
dataframe.fillna(value=0).show()
dataframe.fillna(value=0,subset=["population"]).show()
dataframe.na.fill(value=0).show()
dataframe.na.fill(value=0,subset=["population"]).show()
# Replace Null/None value with empty string
dataframe.fillna(value="").show()
dataframe.na.fill(value="").show()
# Replacing column type with the empty string
# And column city with value “unknown”.
dataframe.fillna("unknown",["city"]) \
.fillna("",["type"]).show()


Databricks-2Databricks-3

PySpark ETL Project-Build a Data Pipeline using S3 and MySQLDatabricks-4
Databricks-5
Databricks-6
Databricks-7

The Spark Session is defined. The "data frame" is defined using the smallzipcode.csv file. Further, the fill(value: Long) signatures are used to replace the NULL/None values with numeric values either zero(0) or any constant value for all the integer and long datatype columns of PySpark DataFrame or the Dataset so, it yields the output as it has just an integer column population with null values. While replacing Null/None values with the Empty String replaces all String type columns with empty/blank string for all the NULL values. Also, while replacing the column type with an empty string and column city with the value “unknown” yields the replacement of null values with an empty string for the type column and replaces it with a constant value “unknown” for the city column.

Master PySpark Functions with ProjectPro! 

Learning how to use and implement PySpark functions like Fillna and Fill is essential for anyone looking to excel in the world of big data analytics. Practical experience gained through real-world projects is the key to becoming proficient in PySpark and other data science tools. ProjectPro offers a valuable opportunity to enhance your skills with its extensive repository of over 270+ projects, making it the ideal platform for those seeking to harness the power of PySpark and other data technologies in their professional journey. So, start your learning journey with ProjectPro and unlock the full potential of PySpark and its functions.

What Users are saying..

profile image

Jingwei Li

Graduate Research assistance at Stony Brook University
linkedin profile url

ProjectPro is an awesome platform that helps me learn much hands-on industrial experience with a step-by-step walkthrough of projects. There are two primary paths to learn: Data Science and Big Data.... Read More

Relevant Projects

Graph Database Modelling using AWS Neptune and Gremlin
In this data analytics project, you will use AWS Neptune graph database and Gremlin query language to analyse various performance metrics of flights.

Build an ETL Pipeline with DBT, Snowflake and Airflow
Data Engineering Project to Build an ETL pipeline using technologies like dbt, Snowflake, and Airflow, ensuring seamless data extraction, transformation, and loading, with efficient monitoring through Slack and email notifications via SNS

GCP Data Ingestion with SQL using Google Cloud Dataflow
In this GCP Project, you will learn to build a data processing pipeline With Apache Beam, Dataflow & BigQuery on GCP using Yelp Dataset.

PySpark Project-Build a Data Pipeline using Hive and Cassandra
In this PySpark ETL Project, you will learn to build a data pipeline and perform ETL operations by integrating PySpark with Hive and Cassandra

Real-time Auto Tracking with Spark-Redis
Spark Project - Discuss real-time monitoring of taxis in a city. The real-time data streaming will be simulated using Flume. The ingestion will be done using Spark Streaming.

Snowflake Real Time Data Warehouse Project for Beginners-1
In this Snowflake Data Warehousing Project, you will learn to implement the Snowflake architecture and build a data warehouse in the cloud to deliver business value.

Databricks Data Lineage and Replication Management
Databricks Project on data lineage and replication management to help you optimize your data management practices | ProjectPro

SQL Project for Data Analysis using Oracle Database-Part 3
In this SQL Project for Data Analysis, you will learn to efficiently write sub-queries and analyse data using various SQL functions and operators.

Python and MongoDB Project for Beginners with Source Code-Part 2
In this Python and MongoDB Project for Beginners, you will learn how to use Apache Sedona and perform advanced analysis on the Transportation dataset.

SQL Project for Data Analysis using Oracle Database-Part 5
In this SQL Project for Data Analysis, you will learn to analyse data using various SQL functions like ROW_NUMBER, RANK, DENSE_RANK, SUBSTR, INSTR, COALESCE and NVL.