sumdistinct, var, varsamp, varpop | Agg functions | Databricks

Here is an detailed description of what sum sumDistinct variance varsamp and varpop aggregate functions in Databricks do.
Last Updated: 12 May 2023

Get access to Big Data projects View all Big Data projects

APACHE SPARK PROJECTS DATA CLEANING PYTHON DATA MUNGING MACHINE LEARNING RECIPES PANDAS CHEATSHEET ALL TAGS

Recipe Objective - Explain sum(), sumDistinct(), variance(), var_samp() and var_pop() aggregate functions in Databricks?

The Aggregate functions in Apache PySpark accept input as the Column type or the column name in the string, follow several other arguments based on the function, and return the Column type. The Aggregate functions operate on the group of rows and calculate the single return value for every group. The PySpark SQL Aggregate functions are further grouped as the “agg_funcs” in the Pyspark. The sum() function returns the sum of all the values present in the column. The sumDistinct() function returns the sum of all distinct values present in the column. The variance() function is the alias for "var_samp". The var_samp() function returns the unbiased variance of the values present in the column. The var_pop() function returns the population variance of the values present in the column.

ETL Orchestration on AWS using Glue and Step Functions

Recipe Objective - Explain sum(), sumDistinct(), variance(), var_samp() and var_pop() aggregate functions in Databricks?
- System Requirements
- Implementing the sum(), sumDistinct(), variance(), var_samp() and var_pop() functions in Databricks in PySpark

System Requirements

Python (3.0 version)
Apache Spark (3.1.1 version)

This recipe explains what are sum(), sumDistinct(), variance(), var_samp() and var_pop() and how to perform them in PySpark.

Implementing the sum(), sumDistinct(), variance(), var_samp() and var_pop() functions in Databricks in PySpark

# Importing packages import pyspark from pyspark.sql import SparkSession from pyspark.sql.functions import sum from pyspark.sql.functions import sumDistinct from pyspark.sql.functions import variance,var_samp, var_pop Databricks-1

The Sparksession, sum, sumDistinct, variance, var_samp and var_pop packages are imported in the environment so as to perform sum(), sumDistinct(), variance(), var_samp() and var_pop( functions in PySpark.

# Implementing the sum(), sumDistinct(), variance(), var_samp() and var_pop() functions in Databricks in PySpark spark = SparkSession.builder.appName('PySpark sum() sumDistinct() variance() var_samp() and var_pop()').getOrCreate() Sample_Data = [("Rahul", "Technology", 8000), ("Prateek", "Finance", 7600), ("Ram", "Sales", 5100), ("Reetu", "Marketing", 4000), ("Himesh", "Sales", 2000), ("Shyam", "Finance", 3500), ("Harsh", "Finance", 4900), ("Ramesh", "Marketing", 4000), ("Raina", "Marketing", 3000), ("Ankit", "Sales", 5100) ] Sample_schema = ["employee_name", "department", "salary"] dataframe = spark.createDataFrame(data = Sample_Data, schema = Sample_schema) dataframe.printSchema() dataframe.show(truncate=False) # Using sum() function dataframe.select(sum("salary")).show(truncate=False) # Using sumDistinct() function dataframe.select(sumDistinct("salary")).show(truncate=False) # Using variance(), var_samp() and var_pop() functions dataframe.select(variance("salary"),var_samp("salary"),var_pop("salary")) \ .show(truncate=False) Databricks-2 Databricks-3 Databricks-4

The "dataframe" value is created in which the Sample_data and Sample_schema are defined. The sum() function returns the sum of all values in the "salary" column. The sumDistinct function returns the sum of all distinct values in the salary column. The variance() function is the alias for "var_samp". The var_samp() function returns the unbiased variance of the values present in the salary column. The var_pop() function returns the population variance of the values present in the salary column.

Download Materials

Databricks_1

Databricks_2

Databricks_3

Databricks_4

What Users are saying..

Gautam Vermani

Data Consultant at Confidential

Having worked in the field of Data Science, I wanted to explore how I can implement projects in other domains, So I thought of connecting with ProjectPro. A project that helped me absorb this topic... Read More

Relevant Projects

Machine Learning Projects

Data Science Projects

Python Projects for Data Science

Data Science Projects in R

Machine Learning Projects for Beginners

Deep Learning Projects

Neural Network Projects

Tensorflow Projects

NLP Projects

Kaggle Projects

IoT Projects

Big Data Projects

Hadoop Real-Time Projects Examples

Spark Projects

Data Analytics Projects for Students

Relevant Projects

AWS Project for Batch Processing with PySpark on AWS EMR

In this AWS Project, you will learn how to perform batch processing on Wikipedia data with PySpark on AWS EMR.

View Project Details

Azure Stream Analytics for Real-Time Cab Service Monitoring

Build an end-to-end stream processing pipeline using Azure Stream Analytics for real time cab service monitoring

View Project Details

Yelp Data Processing Using Spark And Hive Part 1

In this big data project, you will learn how to process data using Spark and Hive as well as perform queries on Hive tables.

View Project Details

PySpark Tutorial - Learn to use Apache Spark with Python

PySpark Project-Get a handle on using Python with Spark through this hands-on data processing spark python tutorial.

View Project Details

Build an AWS ETL Data Pipeline in Python on YouTube Data

AWS Project - Learn how to build ETL Data Pipeline in Python on YouTube Data using Athena, Glue and Lambda

View Project Details

Build a Data Pipeline in AWS using NiFi, Spark, and ELK Stack

In this AWS Project, you will learn how to build a data pipeline Apache NiFi, Apache Spark, AWS S3, Amazon EMR cluster, Amazon OpenSearch, Logstash and Kibana.

View Project Details

Build a Streaming Pipeline with DBT, Snowflake and Kinesis

This dbt project focuses on building a streaming pipeline integrating dbt Cloud, Snowflake and Amazon Kinesis for real-time processing and analysis of Stock Market Data.

View Project Details

Build a Real-Time Dashboard with Spark, Grafana, and InfluxDB

Use Spark , Grafana, and InfluxDB to build a real-time e-commerce users analytics dashboard by consuming different events such as user clicks, orders, demographics

View Project Details

Databricks Real-Time Streaming with Event Hubs and Snowflake

In this Azure Databricks Project, you will learn to use Azure Databricks, Event Hubs, and Snowflake to process and analyze real-time data, specifically in monitoring IoT devices.

View Project Details

Build a real-time Streaming Data Pipeline using Flink and Kinesis

In this big data project on AWS, you will learn how to run an Apache Flink Python application for a real-time streaming platform using Amazon Kinesis.

View Project Details

sumdistinct, var, varsamp, varpop | Agg functions | Databricks

Recipe Objective - Explain sum(), sumDistinct(), variance(), var_samp() and var_pop() aggregate functions in Databricks?

Table of Contents

System Requirements

Implementing the sum(), sumDistinct(), variance(), var_samp() and var_pop() functions in Databricks in PySpark

Gautam Vermani

Relevant Projects

You might also like

Relevant Projects