stddev, stddevsamp, stddevpop | Aggregate functions | Databricks

In this tutorial Databricks skewness, standard deviation, standard deviation score, and standard deviation pop aggregate functions are explained. The system requirement and how to implement it in Python is given in this particular tutorial.

Recipe Objective - Explain skewness(), stddev(), stddev_samp() and stddev_pop() aggregate functions in Databricks?

The Aggregate functions in Apache PySpark accept input as the Column type or the column name in the string and follow several other arguments based on the function and returning the Column type. The Aggregate functions operate on the group of rows and calculate the single return value for every group. The PySpark SQL Aggregate functions are further grouped as the “agg_funcs” in the Pyspark. The skewness() function returns the skewness of the values present in the group. The stddev() function is the alias for "stddev_samp". The stddev_samp() function returns the sample standard deviation of values present in the column. The stddev_pop() function returns the population standard deviation of the values present in the column.

ETL Orchestration on AWS using Glue and Step Functions

System Requirements

  • Python (3.0 version)
  • Apache Spark (3.1.1 version)

This recipe explains what are skewness(), stddev(), stddev_samp() and stddev_pop() functions and how to perform them in PySpark.

Implementing the skewness(), stddev(), stddev_samp() and stddev_pop() functions in Databricks in PySpark

# Importing packages
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import skewness
from pyspark.sql.functions import stddev, stddev_samp, stddev_pop
Databricks-1

The Sparksession, skewness, stddev, stddev_samp and stddev_pop packages are imported in the environment so as to perform skewness(), stddev(), stddev_samp() and stddev_pop() functions in PySpark.

# Implementing the skewness(), stddev(), stddev_samp() and stddev_pop() functions in Databricks in PySpark
spark = SparkSession.builder.appName('PySpark skewness() stddev() stddev_samp() and stddev_pop()').getOrCreate()
Sample_Data = [("Rahul", "Technology", 8000),
("Prateek", "Finance", 7600),
("Ram", "Sales", 5100),
("Reetu", "Marketing", 4000),
("Himesh", "Sales", 2000),
("Shyam", "Finance", 3500),
("Harsh", "Finance", 4900),
("Ramesh", "Marketing", 4000),
("Raina", "Marketing", 3000),
("Ankit", "Sales", 5100)
]
Sample_schema = ["employee_name", "department", "salary"]
dataframe = spark.createDataFrame(data = Sample_Data, schema = Sample_schema)
dataframe.printSchema()
dataframe.show(truncate=False)
# Using skewness() function
dataframe.select(skewness("salary")).show(truncate=False)
# Using stddev(), stddev_samp() and stddev_pop() functions
dataframe.select(stddev("salary"), stddev_samp("salary"), \
stddev_pop("salary")).show(truncate=False)
Databricks-2 Databricks-3 Databricks-4

The "dataframe" value is created in which the Sample_data and Sample_schema are defined. Using the skewness() function returns the skewness of the values present in the salary group. The stddev() function is the alias for "stddev_samp". The stddev_samp() function returns the sample standard deviation of values present in the salary column. The stddev_pop() function returns the population standard deviation of the values present in the salary column.

What Users are saying..

profile image

Gautam Vermani

Data Consultant at Confidential
linkedin profile url

Having worked in the field of Data Science, I wanted to explore how I can implement projects in other domains, So I thought of connecting with ProjectPro. A project that helped me absorb this topic... Read More

Relevant Projects

Build a Real-Time Spark Streaming Pipeline on AWS using Scala
In this Spark Streaming project, you will build a real-time spark streaming pipeline on AWS using Scala and Python.

Snowflake Azure Project to build real-time Twitter feed dashboard
In this Snowflake Azure project, you will ingest generated Twitter feeds to Snowflake in near real-time to power an in-built dashboard utility for obtaining popularity feeds reports.

COVID-19 Data Analysis Project using Python and AWS Stack
COVID-19 Data Analysis Project using Python and AWS to build an automated data pipeline that processes COVID-19 data from Johns Hopkins University and generates interactive dashboards to provide insights into the pandemic for public health officials, researchers, and the general public.

Snowflake Real Time Data Warehouse Project for Beginners-1
In this Snowflake Data Warehousing Project, you will learn to implement the Snowflake architecture and build a data warehouse in the cloud to deliver business value.

GCP Project-Build Pipeline using Dataflow Apache Beam Python
In this GCP Project, you will learn to build a data pipeline using Apache Beam Python on Google Dataflow.

Build an ETL Pipeline with DBT, Snowflake and Airflow
Data Engineering Project to Build an ETL pipeline using technologies like dbt, Snowflake, and Airflow, ensuring seamless data extraction, transformation, and loading, with efficient monitoring through Slack and email notifications via SNS

SQL Project for Data Analysis using Oracle Database-Part 6
In this SQL project, you will learn the basics of data wrangling with SQL to perform operations on missing data, unwanted features and duplicated records.

Build a Data Pipeline in AWS using NiFi, Spark, and ELK Stack
In this AWS Project, you will learn how to build a data pipeline Apache NiFi, Apache Spark, AWS S3, Amazon EMR cluster, Amazon OpenSearch, Logstash and Kibana.

dbt Snowflake Project to Master dbt Fundamentals in Snowflake
DBT Snowflake Project to Master the Fundamentals of DBT and learn how it can be used to build efficient and robust data pipelines with Snowflake.

Movielens Dataset Analysis on Azure
Build a movie recommender system on Azure using Spark SQL to analyse the movielens dataset . Deploy Azure data factory, data pipelines and visualise the analysis.