Explain the sample and sampleBy functions in PySpark in Databricks

This recipe explains what the sample and sampleBy functions in PySpark in Databricks

Recipe Objective - Explain the sample() and sampleBy() functions in PySpark in Databricks?

In PySpark, the sampling (pyspark.sql.DataFrame.sample()) is the widely used mechanism to get the random sample records from the dataset and it is most helpful when there is a larger dataset and the analysis or test of the subset of the data is required that is for example 15% of the original file. The syntax of the sample() file is "sample(with replacement, fraction, seed=None)" in which "fraction" is defined as the Fraction of rows to generate in the range [0.0, 1.0] and it doesn’t guarantee to provide the exact number of a fraction of records. The "seed" is used for sampling (default a random seed) and is further used to reproduce the same random sampling. The "with replacement" is defined as the sample with the replacement or not (default False). The sample() function is defined as the function which is widely used to get Stratified sampling in PySpark without the replacement. Further, it returns the sampling fraction for each stratum and if the stratum is not specified, it takes the zero as the default.

Explore PySpark Machine Learning Tutorial to take your PySpark skills to the next level!

System Requirements

  • Python (3.0 version)
  • Apache Spark (3.1.1 version)

This recipe explains what is sample() function, sampleBy() function and explaining the usage of sample() and sampleBy() in PySpark.

Implementing the sample() function and sampleBy() function in Databricks in PySpark

# Importing packages
import pyspark
from pyspark.sql import SparkSession, Row
from pyspark.sql.types import MapType, StringType
from pyspark.sql.functions import col
from pyspark.sql.types import StructType,StructField, StringType
Databricks-1

The Sparksession, Row, MapType, StringType, col, explode, StructType, StructField, StringType are imported in the environment so as to use sample() function and sampleBy() function in PySpark .

# Implementing the sample() function and sampleBy() function in Databricks in PySpark
spark = SparkSession.builder \
.master("local[1]") \
.appName("sample() and sampleBy() PySpark") \
.getOrCreate() dataframe = spark.range(100)
print(dataframe.sample(0.06).collect())
# Using sample() function
print(dataframe.sample(0.1,123).collect())
print(dataframe.sample(0.1,123).collect())
print(dataframe.sample(0.1,456).collect())
# Using the withReplacement(May contain duplicates)
## With Duplicates
print(dataframe.sample(True,0.3,123).collect())
## Without Duplicates
print(dataframe.sample(0.3,123).collect())
# Using sampleBy() function
dataframe2 = dataframe.select((dataframe.id % 3).alias("key"))
print(dataframe2.sampleBy("key", {0: 0.1, 1: 0.2},0).collect())
Databricks-2

Databricks-3

The Spark Session is defined. The "data frame" is defined using the random range of 100 numbers and wants to get 6% sample records defined with "0.06". Every time the sample() function is run, it returns a different set of sampling records. The sample() function is used on the data frame with "123" and "456" as slices. In the "123" slice, the sampling returns are the same and in the "456" slice number, the sampling returns are the same. By using the sampleBy() method, It returns a sampling fraction for each stratum. Also, If the stratum is not specified then it takes zero as default.

What Users are saying..

profile image

Abhinav Agarwal

Graduate Student at Northwestern University
linkedin profile url

I come from Northwestern University, which is ranked 9th in the US. Although the high-quality academics at school taught me all the basics I needed, obtaining practical experience was a challenge.... Read More

Relevant Projects

Build a Real-Time Spark Streaming Pipeline on AWS using Scala
In this Spark Streaming project, you will build a real-time spark streaming pipeline on AWS using Scala and Python.

SQL Project for Data Analysis using Oracle Database-Part 2
In this SQL Project for Data Analysis, you will learn to efficiently analyse data using JOINS and various other operations accessible through SQL in Oracle Database.

Explore features of Spark SQL in practice on Spark 2.0
The goal of this spark project for students is to explore the features of Spark SQL in practice on the latest version of Spark i.e. Spark 2.0.

Getting Started with Azure Purview for Data Governance
In this Microsoft Azure Purview Project, you will learn how to consume the ingested data and perform analysis to find insights.

Log Analytics Project with Spark Streaming and Kafka
In this spark project, you will use the real-world production logs from NASA Kennedy Space Center WWW server in Florida to perform scalable log analytics with Apache Spark, Python, and Kafka.

Migration of MySQL Databases to Cloud AWS using AWS DMS
IoT-based Data Migration Project using AWS DMS and Aurora Postgres aims to migrate real-time IoT-based data from an MySQL database to the AWS cloud.

AWS Snowflake Data Pipeline Example using Kinesis and Airflow
Learn to build a Snowflake Data Pipeline starting from the EC2 logs to storage in Snowflake and S3 post-transformation and processing through Airflow DAGs

Retail Analytics Project Example using Sqoop, HDFS, and Hive
This Project gives a detailed explanation of How Data Analytics can be used in the Retail Industry, using technologies like Sqoop, HDFS, and Hive.

Yelp Data Processing using Spark and Hive Part 2
In this spark project, we will continue building the data warehouse from the previous project Yelp Data Processing Using Spark And Hive Part 1 and will do further data processing to develop diverse data products.

Learn to Build Regression Models with PySpark and Spark MLlib
In this PySpark Project, you will learn to implement regression machine learning models in SparkMLlib.