sumdistinct, var, varsamp, varpop | Agg functions | Databricks

Here is an detailed description of what sum sumDistinct variance varsamp and varpop aggregate functions in Databricks do.

Recipe Objective - Explain sum(), sumDistinct(), variance(), var_samp() and var_pop() aggregate functions in Databricks?

The Aggregate functions in Apache PySpark accept input as the Column type or the column name in the string, follow several other arguments based on the function, and return the Column type. The Aggregate functions operate on the group of rows and calculate the single return value for every group. The PySpark SQL Aggregate functions are further grouped as the “agg_funcs” in the Pyspark. The sum() function returns the sum of all the values present in the column. The sumDistinct() function returns the sum of all distinct values present in the column. The variance() function is the alias for "var_samp". The var_samp() function returns the unbiased variance of the values present in the column. The var_pop() function returns the population variance of the values present in the column.

ETL Orchestration on AWS using Glue and Step Functions

System Requirements

  • Python (3.0 version)
  • Apache Spark (3.1.1 version)

This recipe explains what are sum(), sumDistinct(), variance(), var_samp() and var_pop() and how to perform them in PySpark.

Implementing the sum(), sumDistinct(), variance(), var_samp() and var_pop() functions in Databricks in PySpark

# Importing packages
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import sum
from pyspark.sql.functions import sumDistinct
from pyspark.sql.functions import variance,var_samp, var_pop
Databricks-1

The Sparksession, sum, sumDistinct, variance, var_samp and var_pop packages are imported in the environment so as to perform sum(), sumDistinct(), variance(), var_samp() and var_pop( functions in PySpark.

# Implementing the sum(), sumDistinct(), variance(), var_samp() and var_pop() functions in Databricks in PySpark
spark = SparkSession.builder.appName('PySpark sum() sumDistinct() variance() var_samp() and var_pop()').getOrCreate()
Sample_Data = [("Rahul", "Technology", 8000),
("Prateek", "Finance", 7600),
("Ram", "Sales", 5100),
("Reetu", "Marketing", 4000),
("Himesh", "Sales", 2000),
("Shyam", "Finance", 3500),
("Harsh", "Finance", 4900),
("Ramesh", "Marketing", 4000),
("Raina", "Marketing", 3000),
("Ankit", "Sales", 5100)
]
Sample_schema = ["employee_name", "department", "salary"]
dataframe = spark.createDataFrame(data = Sample_Data, schema = Sample_schema)
dataframe.printSchema()
dataframe.show(truncate=False)
# Using sum() function
dataframe.select(sum("salary")).show(truncate=False)
# Using sumDistinct() function
dataframe.select(sumDistinct("salary")).show(truncate=False)
# Using variance(), var_samp() and var_pop() functions
dataframe.select(variance("salary"),var_samp("salary"),var_pop("salary")) \
.show(truncate=False)
Databricks-2 Databricks-3 Databricks-4

The "dataframe" value is created in which the Sample_data and Sample_schema are defined. The sum() function returns the sum of all values in the "salary" column. The sumDistinct function returns the sum of all distinct values in the salary column. The variance() function is the alias for "var_samp". The var_samp() function returns the unbiased variance of the values present in the salary column. The var_pop() function returns the population variance of the values present in the salary column.

What Users are saying..

profile image

Anand Kumpatla

Sr Data Scientist @ Doubleslash Software Solutions Pvt Ltd
linkedin profile url

ProjectPro is a unique platform and helps many people in the industry to solve real-life problems with a step-by-step walkthrough of projects. A platform with some fantastic resources to gain... Read More

Relevant Projects

dbt Snowflake Project to Master dbt Fundamentals in Snowflake
DBT Snowflake Project to Master the Fundamentals of DBT and learn how it can be used to build efficient and robust data pipelines with Snowflake.

Build an Analytical Platform for eCommerce using AWS Services
In this AWS Big Data Project, you will use an eCommerce dataset to simulate the logs of user purchases, product views, cart history, and the user’s journey to build batch and real-time pipelines.

AWS Project-Website Monitoring using AWS Lambda and Aurora
In this AWS Project, you will learn the best practices for website monitoring using AWS services like Lambda, Aurora MySQL, Amazon Dynamo DB and Kinesis.

SQL Project for Data Analysis using Oracle Database-Part 6
In this SQL project, you will learn the basics of data wrangling with SQL to perform operations on missing data, unwanted features and duplicated records.

Data Processing and Transformation in Hive using Azure VM
Hive Practice Example - Explore hive usage efficiently for data transformation and processing in this big data project using Azure VM.

Yelp Data Processing Using Spark And Hive Part 1
In this big data project, you will learn how to process data using Spark and Hive as well as perform queries on Hive tables.

Build an AWS ETL Data Pipeline in Python on YouTube Data
AWS Project - Learn how to build ETL Data Pipeline in Python on YouTube Data using Athena, Glue and Lambda

PySpark Tutorial - Learn to use Apache Spark with Python
PySpark Project-Get a handle on using Python with Spark through this hands-on data processing spark python tutorial.

How to deal with slowly changing dimensions using snowflake?
Implement Slowly Changing Dimensions using Snowflake Method - Build Type 1 and Type 2 SCD in Snowflake using the Stream and Task Functionalities

Python and MongoDB Project for Beginners with Source Code-Part 2
In this Python and MongoDB Project for Beginners, you will learn how to use Apache Sedona and perform advanced analysis on the Transportation dataset.