Explain Kurtosis, Min, Max, And Mean Aggregate Functions In PySpark

In this recipe, you will learn what kurtosis min, max, and mean aggregates are in PySpark in DataBricks and how to implement them in Pyspark.

 

Objective For ‘Kurtosis(), min(), max(), And mean() Aggregate Functions in PySpark in Databricks’

This step-by-step recipe will explain the kurtosis(), min(), max(), and mean() aggregate functions in PySpark and their implementation using Databricks.

ProjectPro Free Projects on Big Data and Data Science

PySpark Kurtosis, Min, Max, and Mean Aggregate Functions

The Aggregate functions in Apache PySpark accept input as the Column type or the column name in the string and follow several other arguments based on the process and returning the Column type. The Aggregate functions operate on the group of rows and calculate the single return value for every group. The PySpark SQL Aggregate functions are further grouped as the “agg_funcs” in the Pyspark.

The PySpark kurtosis() function calculates the kurtosis of a column in a PySpark DataFrame, which measures the degree of outliers or extreme values present in the dataset. A higher kurtosis value indicates more outliers, while a lower one indicates a flatter distribution.

The PySpark min and max functions find a given dataset's minimum and maximum values, respectively. You can easily find the PySpark min and max of a column or multiple columns of a PySpark dataframe or RDD (Resilient Distributed Dataset).

You can find the PySpark min of a column as follows-

from pyspark.sql.functions import min

min_value = dataframe_name.select(min("column_name")).collect()[0][0]

You can find the PySpark max value of a column as follows-

from pyspark.sql.functions import max

max_value = dataframe_name.select(max("column_name")).collect()[0][0]

The PySpark mean function calculates the average value of a given dataset. It is implemented using the mean() method in PySpark, which takes a column or list of columns as input and returns the mean value. You can calculate the mean value by dividing the sum of total values in the dataset by the total number of values.

You can find the PySpark mean of a column as follows-

from pyspark.sql.functions import mean

df.select(mean("column_name")).show()

Apart from min and max, PySpark provides two other useful functions, "count()" and "groupBy()", for aggregating and summarizing data in a dataframe. The 'count()' aggregate function determines the number of rows in a dataframe or non-null values in a specific column. The PySpark 'groupBy()' aggregate function is used to group the rows of a dataframe based on one or more columns and apply aggregate functions to the resulting groups. 

Learn Spark SQL for Relational Big Data Processing

Enhance your data analytics knowledge with end-to-end solved big data analytics mini projects for final year students.

System Requirements For PySpark Kurtosis and Aggregate Functions

  • Python (3.0 version)

  • Apache Spark (3.1.1 version)

Implementing The Kurtosis(), Min(), Max(), And Mean() Functions in PySpark

Importing packages
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import kurtosis
from pyspark.sql.functions import min, max, mean

 Databricks-1

The Sparksession, kurtosis, min, max, and mean packages are imported into the environment to perform the kurtosis(), min(), max(), and mean() aggregate functions in PySpark.

Implementing the kurtosis(), min(), max() and mean() functions in Databricks in PySpark
spark = SparkSession.builder.appName('PySpark kurtosis(), min(), max() and mean()').getOrCreate()
Sample_Data = [("Rahul", "Technology", 8000),
("Prateek", "Finance", 7600),
("Ram", "Sales", 5100),
("Reetu", "Marketing", 4000),
("Himesh", "Sales", 2000),
("Shyam", "Finance", 3500),
("Harsh", "Finance", 4900),
("Ramesh", "Marketing", 4000),
("Raina", "Marketing", 3000),
("Ankit", "Sales", 5100)
]
Sample_schema = ["employee_name", "department", "salary"]
dataframe = spark.createDataFrame(data = Sample_Data, schema = Sample_schema)
dataframe.printSchema()
dataframe.show(truncate=False)
# Using kurtosis function
dataframe.select(kurtosis("salary")).show(truncate=False)
# Using max() function
dataframe.select(max("salary")).show(truncate=False)
# Using min() function
dataframe.select(min("salary")).show(truncate=False)
# Using mean() function
dataframe.select(mean("salary")).show(truncate=False)

Are you a beginner looking for Hadoop projects? Check out the ProjectPro repository with unique Hadoop Mini Projects with Source Code to help you grasp Hadoop basics. 

Databricks-2

Databricks-3
Databricks-4

Databricks-5

The "dataframe" value is created in which the Sample_data and Sample_schema are defined. Using the kurtosis() function returns the kurtosis of the values present in the salary group. The min() part currently returns the salary column's minimum value. The max() function produces the maximum value current in the salary column. The mean() function returns the average valuecurrentlynt in the salary column.

Build a job-winning Big Data portfolio with end-to-end solved Apache Spark Projects for Resume and ace that Big Data interview!

FAQs

An aggregate function in PySpark is a function that groups data from multiple rows into a single value. It performs operations like sum, count, average, maximum, minimum, etc., on the data. 

The first() and last() functions aggregate functions in PySpark are used to retrieve the first and last values from a group of values, respectively. These functions operate on columns or expressions and are often used with grouping functions like groupBy() to retrieve each group's first and last values in a DataFrame.

 

Join Millions of Satisfied Developers and Enterprises to Maximize Your Productivity and ROI with ProjectPro - Read ProjectPro Reviews Now!

Access Solved Big Data and Data Science Projects

What Users are saying..

profile image

Savvy Sahai

Data Science Intern, Capgemini
linkedin profile url

As a student looking to break into the field of data engineering and data science, one can get really confused as to which path to take. Very few ways to do it are Google, YouTube, etc. I was one of... Read More

Relevant Projects

Python and MongoDB Project for Beginners with Source Code-Part 2
In this Python and MongoDB Project for Beginners, you will learn how to use Apache Sedona and perform advanced analysis on the Transportation dataset.

Build an ETL Pipeline on EMR using AWS CDK and Power BI
In this ETL Project, you will learn build an ETL Pipeline on Amazon EMR with AWS CDK and Apache Hive. You'll deploy the pipeline using S3, Cloud9, and EMR, and then use Power BI to create dynamic visualizations of your transformed data.

Learn Data Processing with Spark SQL using Scala on AWS
In this AWS Spark SQL project, you will analyze the Movies and Ratings Dataset using RDD and Spark SQL to get hands-on experience on the fundamentals of Scala programming language.

Build Serverless Pipeline using AWS CDK and Lambda in Python
In this AWS Data Engineering Project, you will learn to build a serverless pipeline using AWS CDK and other AWS serverless technologies like AWS Lambda and Glue.

Getting Started with Azure Purview for Data Governance
In this Microsoft Azure Purview Project, you will learn how to consume the ingested data and perform analysis to find insights.

AWS Project-Website Monitoring using AWS Lambda and Aurora
In this AWS Project, you will learn the best practices for website monitoring using AWS services like Lambda, Aurora MySQL, Amazon Dynamo DB and Kinesis.

Build a Streaming Pipeline with DBT, Snowflake and Kinesis
This dbt project focuses on building a streaming pipeline integrating dbt Cloud, Snowflake and Amazon Kinesis for real-time processing and analysis of Stock Market Data.

GCP Data Ingestion with SQL using Google Cloud Dataflow
In this GCP Project, you will learn to build a data processing pipeline With Apache Beam, Dataflow & BigQuery on GCP using Yelp Dataset.

Learn to Build Regression Models with PySpark and Spark MLlib
In this PySpark Project, you will learn to implement regression machine learning models in SparkMLlib.

AWS Snowflake Data Pipeline Example using Kinesis and Airflow
Learn to build a Snowflake Data Pipeline starting from the EC2 logs to storage in Snowflake and S3 post-transformation and processing through Airflow DAGs