Explain the Patitionby function in PySpark in Databricks

This recipe explains what the Patitionby function in PySpark in Databricks

Recipe Objective - Explain the Patitionby() function in PySpark in Databricks?

In PySpark, the partitionBy() is defined as the function of the "pyspark.sql.DataFrameWriter" class which is used to partition the large dataset (DataFrame) into the smaller files based on one or multiple columns while writing to the disk. The Partitioning of the data on the file system is a way to improve the performance of query when dealing with the large dataset in the Data lake so, a Data Lake is the centralized repository of the structured, semi-structured, unstructured, and binary data that further allows storing a large amount of the data as in its original raw format. The PySpark partition is a way to split the large dataset into smaller datasets based on one or more partition keys. When a DataFrame is created from the file/table based on certain parameters, PySpark creates a DataFrame with a certain number of partitions in the memory. So, it is one of the main advantages of the PySpark DataFrame over the Pandas DataFrame. Further, Transformations on the partitioned data run faster as they get executed transformations parallelly for each partition.

System Requirements

  • Python (3.0 version)
  • Apache Spark (3.1.1 version)

This recipe explains what is Partitionby() function is and explains the usage of Partitionby() in PySpark.

Check Out Top SQL Projects to Have on Your Portfolio

Implementing the Partitionby() function in Databricks in PySpark

# Importing packages
import pyspark
from pyspark.sql import SparkSession, Row
from pyspark.sql.types import MapType, StringType
from pyspark.sql.functions import col
from pyspark.sql.types import StructType,StructField, StringType
Databricks-1

The Sparksession, Row, MapType, StringType, col, explode, StructType, StructField, StringType are imported in the environment to use Partitionby() function in PySpark .

# Implementing the Partitionby() function in Databricks in PySpark
spark = SparkSession.builder.appName('Partitionby() PySpark').getOrCreate()
dataframe = spark.read.option("header",True) \
.csv("/FileStore/tables/zipcodes.csv")
dataframe.printSchema()
# Using Partitionby() function
dataframe.write.option("header",True) \
.partitionBy("state") \
.mode("overwrite") \
.csv("/tmp/zipcodesState")
# Using Partitionby() function into Multiple columns
dataframe.write.option("header",True) \
.partitionBy("state","city") \
.mode("overwrite") \
.csv("/tmp/zipcodesState")
# Using PartitionBy() function to control number of partitions
dataframe.write.option("header",True) \
.option("maxRecordsPerFile", 2) \
.partitionBy("state") \
.mode("overwrite") \
.csv("/tmp/zipcodesState")
Databricks-2

Databricks-3

The Spark Session is defined. The "data frame" is defined using the zipcodes.csv file. Further, when the PySpark DataFrame is written to disk by calling the partitionBy() then PySpark splits the records based on the partition column and stores each of the partition data into the sub-directory so, it creates 6 directories. The name of the sub-directory would be the partition column and its value (partition column=value). The Partitionby() function is used to partition multiple columns that are it creates the folder hierarchy for each partition and the first partition is mentioned as state followed by city hence, it creates the city folder inside the state folder (one folder for each city in the state). The "maxRecordsPerFile" is used to control the number of records for each partition so it creates multiple part files for each state and each part file contains just the 2 records.

What Users are saying..

profile image

Savvy Sahai

Data Science Intern, Capgemini
linkedin profile url

As a student looking to break into the field of data engineering and data science, one can get really confused as to which path to take. Very few ways to do it are Google, YouTube, etc. I was one of... Read More

Relevant Projects

Explore features of Spark SQL in practice on Spark 2.0
The goal of this spark project for students is to explore the features of Spark SQL in practice on the latest version of Spark i.e. Spark 2.0.

SQL Project for Data Analysis using Oracle Database-Part 7
In this SQL project, you will learn to perform various data wrangling activities on an ecommerce database.

Build a Data Pipeline in AWS using NiFi, Spark, and ELK Stack
In this AWS Project, you will learn how to build a data pipeline Apache NiFi, Apache Spark, AWS S3, Amazon EMR cluster, Amazon OpenSearch, Logstash and Kibana.

AWS Project-Website Monitoring using AWS Lambda and Aurora
In this AWS Project, you will learn the best practices for website monitoring using AWS services like Lambda, Aurora MySQL, Amazon Dynamo DB and Kinesis.

GCP Data Ingestion with SQL using Google Cloud Dataflow
In this GCP Project, you will learn to build a data processing pipeline With Apache Beam, Dataflow & BigQuery on GCP using Yelp Dataset.

Getting Started with Pyspark on AWS EMR and Athena
In this AWS Big Data Project, you will learn to perform Spark Transformations using a real-time currency ticker API and load the processed data to Athena using Glue Crawler.

Real-time Auto Tracking with Spark-Redis
Spark Project - Discuss real-time monitoring of taxis in a city. The real-time data streaming will be simulated using Flume. The ingestion will be done using Spark Streaming.

SQL Project for Data Analysis using Oracle Database-Part 4
In this SQL Project for Data Analysis, you will learn to efficiently write queries using WITH clause and analyse data using SQL Aggregate Functions and various other operators like EXISTS, HAVING.

Retail Analytics Project Example using Sqoop, HDFS, and Hive
This Project gives a detailed explanation of How Data Analytics can be used in the Retail Industry, using technologies like Sqoop, HDFS, and Hive.

Talend Real-Time Project for ETL Process Automation
In this Talend Project, you will learn how to build an ETL pipeline in Talend Open Studio to automate the process of File Loading and Processing.