Explain the Patitionby function in PySpark in Databricks

This recipe explains what the Patitionby function in PySpark in Databricks
Last Updated: 26 Jul 2022

Get access to Big Data projects View all Big Data projects

APACHE SPARK PROJECTS DATA CLEANING PYTHON DATA MUNGING MACHINE LEARNING RECIPES PANDAS CHEATSHEET ALL TAGS

Recipe Objective - Explain the Patitionby() function in PySpark in Databricks?

In PySpark, the partitionBy() is defined as the function of the "pyspark.sql.DataFrameWriter" class which is used to partition the large dataset (DataFrame) into the smaller files based on one or multiple columns while writing to the disk. The Partitioning of the data on the file system is a way to improve the performance of query when dealing with the large dataset in the Data lake so, a Data Lake is the centralized repository of the structured, semi-structured, unstructured, and binary data that further allows storing a large amount of the data as in its original raw format. The PySpark partition is a way to split the large dataset into smaller datasets based on one or more partition keys. When a DataFrame is created from the file/table based on certain parameters, PySpark creates a DataFrame with a certain number of partitions in the memory. So, it is one of the main advantages of the PySpark DataFrame over the Pandas DataFrame. Further, Transformations on the partitioned data run faster as they get executed transformations parallelly for each partition.

Recipe Objective - Explain the Patitionby() function in PySpark in Databricks?
- System Requirements
- Implementing the Partitionby() function in Databricks in PySpark

System Requirements

Python (3.0 version)
Apache Spark (3.1.1 version)

This recipe explains what is Partitionby() function is and explains the usage of Partitionby() in PySpark.

Check Out Top SQL Projects to Have on Your Portfolio

Implementing the Partitionby() function in Databricks in PySpark

# Importing packages import pyspark from pyspark.sql import SparkSession, Row from pyspark.sql.types import MapType, StringType from pyspark.sql.functions import col from pyspark.sql.types import StructType,StructField, StringType Databricks-1

The Sparksession, Row, MapType, StringType, col, explode, StructType, StructField, StringType are imported in the environment to use Partitionby() function in PySpark .

# Implementing the Partitionby() function in Databricks in PySpark spark = SparkSession.builder.appName('Partitionby() PySpark').getOrCreate() dataframe = spark.read.option("header",True) \ .csv("/FileStore/tables/zipcodes.csv") dataframe.printSchema() # Using Partitionby() function dataframe.write.option("header",True) \ .partitionBy("state") \ .mode("overwrite") \ .csv("/tmp/zipcodesState") # Using Partitionby() function into Multiple columns dataframe.write.option("header",True) \ .partitionBy("state","city") \ .mode("overwrite") \ .csv("/tmp/zipcodesState") # Using PartitionBy() function to control number of partitions dataframe.write.option("header",True) \ .option("maxRecordsPerFile", 2) \ .partitionBy("state") \ .mode("overwrite") \ .csv("/tmp/zipcodesState") Databricks-2
Databricks-3

The Spark Session is defined. The "data frame" is defined using the zipcodes.csv file. Further, when the PySpark DataFrame is written to disk by calling the partitionBy() then PySpark splits the records based on the partition column and stores each of the partition data into the sub-directory so, it creates 6 directories. The name of the sub-directory would be the partition column and its value (partition column=value). The Partitionby() function is used to partition multiple columns that are it creates the folder hierarchy for each partition and the first partition is mentioned as state followed by city hence, it creates the city folder inside the state folder (one folder for each city in the state). The "maxRecordsPerFile" is used to control the number of records for each partition so it creates multiple part files for each state and each part file contains just the 2 records.

Download Materials

Databricks_1

Databricks_2

Databricks_3

zipcodes

What Users are saying..

Savvy Sahai

Data Science Intern, Capgemini

As a student looking to break into the field of data engineering and data science, one can get really confused as to which path to take. Very few ways to do it are Google, YouTube, etc. I was one of... Read More

Relevant Projects

Machine Learning Projects

Data Science Projects

Python Projects for Data Science

Data Science Projects in R

Machine Learning Projects for Beginners

Deep Learning Projects

Neural Network Projects

Tensorflow Projects

NLP Projects

Kaggle Projects

IoT Projects

Big Data Projects

Hadoop Real-Time Projects Examples

Spark Projects

Data Analytics Projects for Students

Relevant Projects

Explore features of Spark SQL in practice on Spark 2.0

The goal of this spark project for students is to explore the features of Spark SQL in practice on the latest version of Spark i.e. Spark 2.0.

View Project Details

SQL Project for Data Analysis using Oracle Database-Part 7

In this SQL project, you will learn to perform various data wrangling activities on an ecommerce database.

View Project Details

Build a Data Pipeline in AWS using NiFi, Spark, and ELK Stack

In this AWS Project, you will learn how to build a data pipeline Apache NiFi, Apache Spark, AWS S3, Amazon EMR cluster, Amazon OpenSearch, Logstash and Kibana.

View Project Details

AWS Project-Website Monitoring using AWS Lambda and Aurora

In this AWS Project, you will learn the best practices for website monitoring using AWS services like Lambda, Aurora MySQL, Amazon Dynamo DB and Kinesis.

View Project Details

GCP Data Ingestion with SQL using Google Cloud Dataflow

In this GCP Project, you will learn to build a data processing pipeline With Apache Beam, Dataflow & BigQuery on GCP using Yelp Dataset.

View Project Details

Getting Started with Pyspark on AWS EMR and Athena

In this AWS Big Data Project, you will learn to perform Spark Transformations using a real-time currency ticker API and load the processed data to Athena using Glue Crawler.

View Project Details

Real-time Auto Tracking with Spark-Redis

Spark Project - Discuss real-time monitoring of taxis in a city. The real-time data streaming will be simulated using Flume. The ingestion will be done using Spark Streaming.

View Project Details

SQL Project for Data Analysis using Oracle Database-Part 4

In this SQL Project for Data Analysis, you will learn to efficiently write queries using WITH clause and analyse data using SQL Aggregate Functions and various other operators like EXISTS, HAVING.

View Project Details

Retail Analytics Project Example using Sqoop, HDFS, and Hive

This Project gives a detailed explanation of How Data Analytics can be used in the Retail Industry, using technologies like Sqoop, HDFS, and Hive.

View Project Details

Talend Real-Time Project for ETL Process Automation

In this Talend Project, you will learn how to build an ETL pipeline in Talend Open Studio to automate the process of File Loading and Processing.

View Project Details

Explain the Patitionby function in PySpark in Databricks

Recipe Objective - Explain the Patitionby() function in PySpark in Databricks?

Table of Contents

System Requirements

Implementing the Partitionby() function in Databricks in PySpark

Savvy Sahai

Relevant Projects

You might also like

Relevant Projects