Explain the Patitionby function in PySpark in Databricks

This recipe explains what the Patitionby function in PySpark in Databricks
Last Updated: 26 Jul 2022

Get access to Big Data projects View all Big Data projects

APACHE SPARK PROJECTS DATA CLEANING PYTHON DATA MUNGING MACHINE LEARNING RECIPES PANDAS CHEATSHEET ALL TAGS

Recipe Objective - Explain the Patitionby() function in PySpark in Databricks?

In PySpark, the partitionBy() is defined as the function of the "pyspark.sql.DataFrameWriter" class which is used to partition the large dataset (DataFrame) into the smaller files based on one or multiple columns while writing to the disk. The Partitioning of the data on the file system is a way to improve the performance of query when dealing with the large dataset in the Data lake so, a Data Lake is the centralized repository of the structured, semi-structured, unstructured, and binary data that further allows storing a large amount of the data as in its original raw format. The PySpark partition is a way to split the large dataset into smaller datasets based on one or more partition keys. When a DataFrame is created from the file/table based on certain parameters, PySpark creates a DataFrame with a certain number of partitions in the memory. So, it is one of the main advantages of the PySpark DataFrame over the Pandas DataFrame. Further, Transformations on the partitioned data run faster as they get executed transformations parallelly for each partition.

Recipe Objective - Explain the Patitionby() function in PySpark in Databricks?
- System Requirements
- Implementing the Partitionby() function in Databricks in PySpark

System Requirements

Python (3.0 version)
Apache Spark (3.1.1 version)

This recipe explains what is Partitionby() function is and explains the usage of Partitionby() in PySpark.

Check Out Top SQL Projects to Have on Your Portfolio

Implementing the Partitionby() function in Databricks in PySpark

# Importing packages import pyspark from pyspark.sql import SparkSession, Row from pyspark.sql.types import MapType, StringType from pyspark.sql.functions import col from pyspark.sql.types import StructType,StructField, StringType Databricks-1

The Sparksession, Row, MapType, StringType, col, explode, StructType, StructField, StringType are imported in the environment to use Partitionby() function in PySpark .

# Implementing the Partitionby() function in Databricks in PySpark spark = SparkSession.builder.appName('Partitionby() PySpark').getOrCreate() dataframe = spark.read.option("header",True) \ .csv("/FileStore/tables/zipcodes.csv") dataframe.printSchema() # Using Partitionby() function dataframe.write.option("header",True) \ .partitionBy("state") \ .mode("overwrite") \ .csv("/tmp/zipcodesState") # Using Partitionby() function into Multiple columns dataframe.write.option("header",True) \ .partitionBy("state","city") \ .mode("overwrite") \ .csv("/tmp/zipcodesState") # Using PartitionBy() function to control number of partitions dataframe.write.option("header",True) \ .option("maxRecordsPerFile", 2) \ .partitionBy("state") \ .mode("overwrite") \ .csv("/tmp/zipcodesState") Databricks-2
Databricks-3

The Spark Session is defined. The "data frame" is defined using the zipcodes.csv file. Further, when the PySpark DataFrame is written to disk by calling the partitionBy() then PySpark splits the records based on the partition column and stores each of the partition data into the sub-directory so, it creates 6 directories. The name of the sub-directory would be the partition column and its value (partition column=value). The Partitionby() function is used to partition multiple columns that are it creates the folder hierarchy for each partition and the first partition is mentioned as state followed by city hence, it creates the city folder inside the state folder (one folder for each city in the state). The "maxRecordsPerFile" is used to control the number of records for each partition so it creates multiple part files for each state and each part file contains just the 2 records.

Download Materials

Databricks_1

Databricks_2

Databricks_3

zipcodes

What Users are saying..

Ameeruddin Mohammed

ETL (Abintio) developer at IBM

I come from a background in Marketing and Analytics and when I developed an interest in Machine Learning algorithms, I did multiple in-class courses from reputed institutions though I got good... Read More

Relevant Projects

Machine Learning Projects

Data Science Projects

Python Projects for Data Science

Data Science Projects in R

Machine Learning Projects for Beginners

Deep Learning Projects

Neural Network Projects

Tensorflow Projects

NLP Projects

Kaggle Projects

IoT Projects

Big Data Projects

Hadoop Real-Time Projects Examples

Spark Projects

Data Analytics Projects for Students

Relevant Projects

Building Real-Time AWS Log Analytics Solution

In this AWS Project, you will build an end-to-end log analytics solution to collect, ingest and process data. The processed data can be analysed to monitor the health of production systems on AWS.

View Project Details

Getting Started with Pyspark on AWS EMR and Athena

In this AWS Big Data Project, you will learn to perform Spark Transformations using a real-time currency ticker API and load the processed data to Athena using Glue Crawler.

View Project Details

Build an ETL Pipeline for Financial Data Analytics on GCP-IaC

In this GCP Project, you will learn to build an ETL pipeline on Google Cloud Platform to maximize the efficiency of financial data analytics with GCP-IaC.

View Project Details

Migration of MySQL Databases to Cloud AWS using AWS DMS

IoT-based Data Migration Project using AWS DMS and Aurora Postgres aims to migrate real-time IoT-based data from an MySQL database to the AWS cloud.

View Project Details

Airline Dataset Analysis using Hadoop, Hive, Pig and Athena

Hadoop Project- Perform basic big data analysis on airline dataset using big data tools -Pig, Hive and Athena.

View Project Details

Log Analytics Project with Spark Streaming and Kafka

In this spark project, you will use the real-world production logs from NASA Kennedy Space Center WWW server in Florida to perform scalable log analytics with Apache Spark, Python, and Kafka.

View Project Details

Build a Streaming Pipeline with DBT, Snowflake and Kinesis

This dbt project focuses on building a streaming pipeline integrating dbt Cloud, Snowflake and Amazon Kinesis for real-time processing and analysis of Stock Market Data.

View Project Details

Azure Stream Analytics for Real-Time Cab Service Monitoring

Build an end-to-end stream processing pipeline using Azure Stream Analytics for real time cab service monitoring

View Project Details

GCP Project to Explore Cloud Functions using Python Part 1

In this project we will explore the Cloud Services of GCP such as Cloud Storage, Cloud Engine and PubSub

View Project Details

SQL Project for Data Analysis using Oracle Database-Part 5

In this SQL Project for Data Analysis, you will learn to analyse data using various SQL functions like ROW_NUMBER, RANK, DENSE_RANK, SUBSTR, INSTR, COALESCE and NVL.

View Project Details

Explain the Patitionby function in PySpark in Databricks

Recipe Objective - Explain the Patitionby() function in PySpark in Databricks?

Table of Contents

System Requirements

Implementing the Partitionby() function in Databricks in PySpark

Ameeruddin Mohammed

Relevant Projects

You might also like

Relevant Projects