Explain the sample and sampleBy functions in PySpark in Databricks

This recipe explains what the sample and sampleBy functions in PySpark in Databricks
Last Updated: 12 May 2023

Get access to Big Data projects View all Big Data projects

APACHE SPARK PROJECTS DATA CLEANING PYTHON DATA MUNGING MACHINE LEARNING RECIPES PANDAS CHEATSHEET ALL TAGS

Recipe Objective - Explain the sample() and sampleBy() functions in PySpark in Databricks?

In PySpark, the sampling (pyspark.sql.DataFrame.sample()) is the widely used mechanism to get the random sample records from the dataset and it is most helpful when there is a larger dataset and the analysis or test of the subset of the data is required that is for example 15% of the original file. The syntax of the sample() file is "sample(with replacement, fraction, seed=None)" in which "fraction" is defined as the Fraction of rows to generate in the range [0.0, 1.0] and it doesn’t guarantee to provide the exact number of a fraction of records. The "seed" is used for sampling (default a random seed) and is further used to reproduce the same random sampling. The "with replacement" is defined as the sample with the replacement or not (default False). The sample() function is defined as the function which is widely used to get Stratified sampling in PySpark without the replacement. Further, it returns the sampling fraction for each stratum and if the stratum is not specified, it takes the zero as the default.

Explore PySpark Machine Learning Tutorial to take your PySpark skills to the next level!

Recipe Objective - Explain the sample() and sampleBy() functions in PySpark in Databricks?
- System Requirements
- Implementing the sample() function and sampleBy() function in Databricks in PySpark

System Requirements

Python (3.0 version)
Apache Spark (3.1.1 version)

This recipe explains what is sample() function, sampleBy() function and explaining the usage of sample() and sampleBy() in PySpark.

Implementing the sample() function and sampleBy() function in Databricks in PySpark

# Importing packages import pyspark from pyspark.sql import SparkSession, Row from pyspark.sql.types import MapType, StringType from pyspark.sql.functions import col from pyspark.sql.types import StructType,StructField, StringType Databricks-1

The Sparksession, Row, MapType, StringType, col, explode, StructType, StructField, StringType are imported in the environment so as to use sample() function and sampleBy() function in PySpark .

# Implementing the sample() function and sampleBy() function in Databricks in PySpark spark = SparkSession.builder \ .master("local[1]") \ .appName("sample() and sampleBy() PySpark") \ .getOrCreate() dataframe = spark.range(100) print(dataframe.sample(0.06).collect()) # Using sample() function print(dataframe.sample(0.1,123).collect()) print(dataframe.sample(0.1,123).collect()) print(dataframe.sample(0.1,456).collect()) # Using the withReplacement(May contain duplicates) ## With Duplicates print(dataframe.sample(True,0.3,123).collect()) ## Without Duplicates print(dataframe.sample(0.3,123).collect()) # Using sampleBy() function dataframe2 = dataframe.select((dataframe.id % 3).alias("key")) print(dataframe2.sampleBy("key", {0: 0.1, 1: 0.2},0).collect()) Databricks-2
Databricks-3

The Spark Session is defined. The "data frame" is defined using the random range of 100 numbers and wants to get 6% sample records defined with "0.06". Every time the sample() function is run, it returns a different set of sampling records. The sample() function is used on the data frame with "123" and "456" as slices. In the "123" slice, the sampling returns are the same and in the "456" slice number, the sampling returns are the same. By using the sampleBy() method, It returns a sampling fraction for each stratum. Also, If the stratum is not specified then it takes zero as default.

Download Materials

Databricks_1

Databricks_2

Databricks_3

What Users are saying..

Abhinav Agarwal

Graduate Student at Northwestern University

I come from Northwestern University, which is ranked 9th in the US. Although the high-quality academics at school taught me all the basics I needed, obtaining practical experience was a challenge.... Read More

Relevant Projects

Machine Learning Projects

Data Science Projects

Python Projects for Data Science

Data Science Projects in R

Machine Learning Projects for Beginners

Deep Learning Projects

Neural Network Projects

Tensorflow Projects

NLP Projects

Kaggle Projects

IoT Projects

Big Data Projects

Hadoop Real-Time Projects Examples

Spark Projects

Data Analytics Projects for Students

Relevant Projects

Build a Real-Time Spark Streaming Pipeline on AWS using Scala

In this Spark Streaming project, you will build a real-time spark streaming pipeline on AWS using Scala and Python.

View Project Details

SQL Project for Data Analysis using Oracle Database-Part 2

In this SQL Project for Data Analysis, you will learn to efficiently analyse data using JOINS and various other operations accessible through SQL in Oracle Database.

View Project Details

Explore features of Spark SQL in practice on Spark 2.0

The goal of this spark project for students is to explore the features of Spark SQL in practice on the latest version of Spark i.e. Spark 2.0.

View Project Details

Getting Started with Azure Purview for Data Governance

In this Microsoft Azure Purview Project, you will learn how to consume the ingested data and perform analysis to find insights.

View Project Details

Log Analytics Project with Spark Streaming and Kafka

In this spark project, you will use the real-world production logs from NASA Kennedy Space Center WWW server in Florida to perform scalable log analytics with Apache Spark, Python, and Kafka.

View Project Details

Migration of MySQL Databases to Cloud AWS using AWS DMS

IoT-based Data Migration Project using AWS DMS and Aurora Postgres aims to migrate real-time IoT-based data from an MySQL database to the AWS cloud.

View Project Details

AWS Snowflake Data Pipeline Example using Kinesis and Airflow

Learn to build a Snowflake Data Pipeline starting from the EC2 logs to storage in Snowflake and S3 post-transformation and processing through Airflow DAGs

View Project Details

Retail Analytics Project Example using Sqoop, HDFS, and Hive

This Project gives a detailed explanation of How Data Analytics can be used in the Retail Industry, using technologies like Sqoop, HDFS, and Hive.

View Project Details

Yelp Data Processing using Spark and Hive Part 2

In this spark project, we will continue building the data warehouse from the previous project Yelp Data Processing Using Spark And Hive Part 1 and will do further data processing to develop diverse data products.

View Project Details

Learn to Build Regression Models with PySpark and Spark MLlib

In this PySpark Project, you will learn to implement regression machine learning models in SparkMLlib.

View Project Details

Explain the sample and sampleBy functions in PySpark in Databricks

Recipe Objective - Explain the sample() and sampleBy() functions in PySpark in Databricks?

Table of Contents

System Requirements

Implementing the sample() function and sampleBy() function in Databricks in PySpark

Abhinav Agarwal

Relevant Projects

You might also like

Relevant Projects