Explain Kurtosis, Min, Max, And Mean Aggregate Functions In PySpark

In this recipe, you will learn what kurtosis min, max, and mean aggregates are in PySpark in DataBricks and how to implement them in Pyspark.
Last Updated: 11 Apr 2023

Get access to Big Data projects View all Big Data projects

APACHE SPARK PROJECTS DATA CLEANING PYTHON DATA MUNGING MACHINE LEARNING RECIPES PANDAS CHEATSHEET ALL TAGS

Objective For ‘Kurtosis(), min(), max(), And mean() Aggregate Functions in PySpark in Databricks’
PySpark Kurtosis, Min, Max, and Mean Aggregate Functions
FAQs

Objective For ‘Kurtosis(), min(), max(), And mean() Aggregate Functions in PySpark in Databricks’

This step-by-step recipe will explain the kurtosis(), min(), max(), and mean() aggregate functions in PySpark and their implementation using Databricks.

PySpark Kurtosis, Min, Max, and Mean Aggregate Functions

The Aggregate functions in Apache PySpark accept input as the Column type or the column name in the string and follow several other arguments based on the process and returning the Column type. The Aggregate functions operate on the group of rows and calculate the single return value for every group. The PySpark SQL Aggregate functions are further grouped as the “agg_funcs” in the Pyspark.

The PySpark kurtosis() function calculates the kurtosis of a column in a PySpark DataFrame, which measures the degree of outliers or extreme values present in the dataset. A higher kurtosis value indicates more outliers, while a lower one indicates a flatter distribution.

The PySpark min and max functions find a given dataset's minimum and maximum values, respectively. You can easily find the PySpark min and max of a column or multiple columns of a PySpark dataframe or RDD (Resilient Distributed Dataset).

You can find the PySpark min of a column as follows-

from pyspark.sql.functions import min

min_value = dataframe_name.select(min("column_name")).collect()[0][0]

You can find the PySpark max value of a column as follows-

from pyspark.sql.functions import max

max_value = dataframe_name.select(max("column_name")).collect()[0][0]

The PySpark mean function calculates the average value of a given dataset. It is implemented using the mean() method in PySpark, which takes a column or list of columns as input and returns the mean value. You can calculate the mean value by dividing the sum of total values in the dataset by the total number of values.

You can find the PySpark mean of a column as follows-

from pyspark.sql.functions import mean

df.select(mean("column_name")).show()

Apart from min and max, PySpark provides two other useful functions, "count()" and "groupBy()", for aggregating and summarizing data in a dataframe. The 'count()' aggregate function determines the number of rows in a dataframe or non-null values in a specific column. The PySpark 'groupBy()' aggregate function is used to group the rows of a dataframe based on one or more columns and apply aggregate functions to the resulting groups.

Learn Spark SQL for Relational Big Data Processing

Enhance your data analytics knowledge with end-to-end solved big data analytics mini projects for final year students.

System Requirements For PySpark Kurtosis and Aggregate Functions

Python (3.0 version)
Apache Spark (3.1.1 version)

Implementing The Kurtosis(), Min(), Max(), And Mean() Functions in PySpark

Importing packages
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import kurtosis
from pyspark.sql.functions import min, max, mean

Databricks-1

The Sparksession, kurtosis, min, max, and mean packages are imported into the environment to perform the kurtosis(), min(), max(), and mean() aggregate functions in PySpark.

Implementing the kurtosis(), min(), max() and mean() functions in Databricks in PySpark
spark = SparkSession.builder.appName('PySpark kurtosis(), min(), max() and mean()').getOrCreate()
Sample_Data = [("Rahul", "Technology", 8000),
("Prateek", "Finance", 7600),
("Ram", "Sales", 5100),
("Reetu", "Marketing", 4000),
("Himesh", "Sales", 2000),
("Shyam", "Finance", 3500),
("Harsh", "Finance", 4900),
("Ramesh", "Marketing", 4000),
("Raina", "Marketing", 3000),
("Ankit", "Sales", 5100)
]
Sample_schema = ["employee_name", "department", "salary"]
dataframe = spark.createDataFrame(data = Sample_Data, schema = Sample_schema)
dataframe.printSchema()
dataframe.show(truncate=False)
# Using kurtosis function
dataframe.select(kurtosis("salary")).show(truncate=False)
# Using max() function
dataframe.select(max("salary")).show(truncate=False)
# Using min() function
dataframe.select(min("salary")).show(truncate=False)
# Using mean() function
dataframe.select(mean("salary")).show(truncate=False)

Are you a beginner looking for Hadoop projects? Check out the ProjectPro repository with unique Hadoop Mini Projects with Source Code to help you grasp Hadoop basics.

Databricks-2

Databricks-3
Databricks-4

The "dataframe" value is created in which the Sample_data and Sample_schema are defined. Using the kurtosis() function returns the kurtosis of the values present in the salary group. The min() part currently returns the salary column's minimum value. The max() function produces the maximum value current in the salary column. The mean() function returns the average valuecurrentlynt in the salary column.

Build a job-winning Big Data portfolio with end-to-end solved Apache Spark Projects for Resume and ace that Big Data interview!

FAQs

What is an aggregate function in PySpark?

An aggregate function in PySpark is a function that groups data from multiple rows into a single value. It performs operations like sum, count, average, maximum, minimum, etc., on the data.

What are the first and last aggregate functions in PySpark?

The first() and last() functions aggregate functions in PySpark are used to retrieve the first and last values from a group of values, respectively. These functions operate on columns or expressions and are often used with grouping functions like groupBy() to retrieve each group's first and last values in a DataFrame.

Join Millions of Satisfied Developers and Enterprises to Maximize Your Productivity and ROI with ProjectPro - Read ProjectPro Reviews Now!

Download Materials

Databricks_1

Databricks_2

Databricks_3

Databricks_4

Databricks_5

What Users are saying..

Savvy Sahai

Data Science Intern, Capgemini

As a student looking to break into the field of data engineering and data science, one can get really confused as to which path to take. Very few ways to do it are Google, YouTube, etc. I was one of... Read More

Explain Kurtosis, Min, Max, And Mean Aggregate Functions In PySpark

Table of Contents

Objective For ‘Kurtosis(), min(), max(), And mean() Aggregate Functions in PySpark in Databricks’

PySpark Kurtosis, Min, Max, and Mean Aggregate Functions

System Requirements For PySpark Kurtosis and Aggregate Functions

Implementing The Kurtosis(), Min(), Max(), And Mean() Functions in PySpark

FAQs

What is an aggregate function in PySpark?

What are the first and last aggregate functions in PySpark?

Savvy Sahai

Relevant Projects

You might also like

Relevant Projects