Explain the Patitionby function in PySpark in Databricks

This recipe explains what the Patitionby function in PySpark in Databricks

Recipe Objective - Explain the Patitionby() function in PySpark in Databricks?

In PySpark, the partitionBy() is defined as the function of the "pyspark.sql.DataFrameWriter" class which is used to partition the large dataset (DataFrame) into the smaller files based on one or multiple columns while writing to the disk. The Partitioning of the data on the file system is a way to improve the performance of query when dealing with the large dataset in the Data lake so, a Data Lake is the centralized repository of the structured, semi-structured, unstructured, and binary data that further allows storing a large amount of the data as in its original raw format. The PySpark partition is a way to split the large dataset into smaller datasets based on one or more partition keys. When a DataFrame is created from the file/table based on certain parameters, PySpark creates a DataFrame with a certain number of partitions in the memory. So, it is one of the main advantages of the PySpark DataFrame over the Pandas DataFrame. Further, Transformations on the partitioned data run faster as they get executed transformations parallelly for each partition.

System Requirements

  • Python (3.0 version)
  • Apache Spark (3.1.1 version)

This recipe explains what is Partitionby() function is and explains the usage of Partitionby() in PySpark.

Check Out Top SQL Projects to Have on Your Portfolio

Implementing the Partitionby() function in Databricks in PySpark

# Importing packages
import pyspark
from pyspark.sql import SparkSession, Row
from pyspark.sql.types import MapType, StringType
from pyspark.sql.functions import col
from pyspark.sql.types import StructType,StructField, StringType
Databricks-1

The Sparksession, Row, MapType, StringType, col, explode, StructType, StructField, StringType are imported in the environment to use Partitionby() function in PySpark .

# Implementing the Partitionby() function in Databricks in PySpark
spark = SparkSession.builder.appName('Partitionby() PySpark').getOrCreate()
dataframe = spark.read.option("header",True) \
.csv("/FileStore/tables/zipcodes.csv")
dataframe.printSchema()
# Using Partitionby() function
dataframe.write.option("header",True) \
.partitionBy("state") \
.mode("overwrite") \
.csv("/tmp/zipcodesState")
# Using Partitionby() function into Multiple columns
dataframe.write.option("header",True) \
.partitionBy("state","city") \
.mode("overwrite") \
.csv("/tmp/zipcodesState")
# Using PartitionBy() function to control number of partitions
dataframe.write.option("header",True) \
.option("maxRecordsPerFile", 2) \
.partitionBy("state") \
.mode("overwrite") \
.csv("/tmp/zipcodesState")
Databricks-2

Databricks-3

The Spark Session is defined. The "data frame" is defined using the zipcodes.csv file. Further, when the PySpark DataFrame is written to disk by calling the partitionBy() then PySpark splits the records based on the partition column and stores each of the partition data into the sub-directory so, it creates 6 directories. The name of the sub-directory would be the partition column and its value (partition column=value). The Partitionby() function is used to partition multiple columns that are it creates the folder hierarchy for each partition and the first partition is mentioned as state followed by city hence, it creates the city folder inside the state folder (one folder for each city in the state). The "maxRecordsPerFile" is used to control the number of records for each partition so it creates multiple part files for each state and each part file contains just the 2 records.

What Users are saying..

profile image

Ameeruddin Mohammed

ETL (Abintio) developer at IBM
linkedin profile url

I come from a background in Marketing and Analytics and when I developed an interest in Machine Learning algorithms, I did multiple in-class courses from reputed institutions though I got good... Read More

Relevant Projects

Building Real-Time AWS Log Analytics Solution
In this AWS Project, you will build an end-to-end log analytics solution to collect, ingest and process data. The processed data can be analysed to monitor the health of production systems on AWS.

Getting Started with Pyspark on AWS EMR and Athena
In this AWS Big Data Project, you will learn to perform Spark Transformations using a real-time currency ticker API and load the processed data to Athena using Glue Crawler.

Build an ETL Pipeline for Financial Data Analytics on GCP-IaC
In this GCP Project, you will learn to build an ETL pipeline on Google Cloud Platform to maximize the efficiency of financial data analytics with GCP-IaC.

Migration of MySQL Databases to Cloud AWS using AWS DMS
IoT-based Data Migration Project using AWS DMS and Aurora Postgres aims to migrate real-time IoT-based data from an MySQL database to the AWS cloud.

Airline Dataset Analysis using Hadoop, Hive, Pig and Athena
Hadoop Project- Perform basic big data analysis on airline dataset using big data tools -Pig, Hive and Athena.

Log Analytics Project with Spark Streaming and Kafka
In this spark project, you will use the real-world production logs from NASA Kennedy Space Center WWW server in Florida to perform scalable log analytics with Apache Spark, Python, and Kafka.

Build a Streaming Pipeline with DBT, Snowflake and Kinesis
This dbt project focuses on building a streaming pipeline integrating dbt Cloud, Snowflake and Amazon Kinesis for real-time processing and analysis of Stock Market Data.

Azure Stream Analytics for Real-Time Cab Service Monitoring
Build an end-to-end stream processing pipeline using Azure Stream Analytics for real time cab service monitoring

GCP Project to Explore Cloud Functions using Python Part 1
In this project we will explore the Cloud Services of GCP such as Cloud Storage, Cloud Engine and PubSub

SQL Project for Data Analysis using Oracle Database-Part 5
In this SQL Project for Data Analysis, you will learn to analyse data using various SQL functions like ROW_NUMBER, RANK, DENSE_RANK, SUBSTR, INSTR, COALESCE and NVL.