Explain the sample and sampleBy functions in PySpark in Databricks

This recipe explains what the sample and sampleBy functions in PySpark in Databricks

Recipe Objective - Explain the sample() and sampleBy() functions in PySpark in Databricks?

In PySpark, the sampling (pyspark.sql.DataFrame.sample()) is the widely used mechanism to get the random sample records from the dataset and it is most helpful when there is a larger dataset and the analysis or test of the subset of the data is required that is for example 15% of the original file. The syntax of the sample() file is "sample(with replacement, fraction, seed=None)" in which "fraction" is defined as the Fraction of rows to generate in the range [0.0, 1.0] and it doesn’t guarantee to provide the exact number of a fraction of records. The "seed" is used for sampling (default a random seed) and is further used to reproduce the same random sampling. The "with replacement" is defined as the sample with the replacement or not (default False). The sample() function is defined as the function which is widely used to get Stratified sampling in PySpark without the replacement. Further, it returns the sampling fraction for each stratum and if the stratum is not specified, it takes the zero as the default.

Explore PySpark Machine Learning Tutorial to take your PySpark skills to the next level!

System Requirements

  • Python (3.0 version)
  • Apache Spark (3.1.1 version)

This recipe explains what is sample() function, sampleBy() function and explaining the usage of sample() and sampleBy() in PySpark.

Implementing the sample() function and sampleBy() function in Databricks in PySpark

# Importing packages
import pyspark
from pyspark.sql import SparkSession, Row
from pyspark.sql.types import MapType, StringType
from pyspark.sql.functions import col
from pyspark.sql.types import StructType,StructField, StringType
Databricks-1

The Sparksession, Row, MapType, StringType, col, explode, StructType, StructField, StringType are imported in the environment so as to use sample() function and sampleBy() function in PySpark .

# Implementing the sample() function and sampleBy() function in Databricks in PySpark
spark = SparkSession.builder \
.master("local[1]") \
.appName("sample() and sampleBy() PySpark") \
.getOrCreate() dataframe = spark.range(100)
print(dataframe.sample(0.06).collect())
# Using sample() function
print(dataframe.sample(0.1,123).collect())
print(dataframe.sample(0.1,123).collect())
print(dataframe.sample(0.1,456).collect())
# Using the withReplacement(May contain duplicates)
## With Duplicates
print(dataframe.sample(True,0.3,123).collect())
## Without Duplicates
print(dataframe.sample(0.3,123).collect())
# Using sampleBy() function
dataframe2 = dataframe.select((dataframe.id % 3).alias("key"))
print(dataframe2.sampleBy("key", {0: 0.1, 1: 0.2},0).collect())
Databricks-2

Databricks-3

The Spark Session is defined. The "data frame" is defined using the random range of 100 numbers and wants to get 6% sample records defined with "0.06". Every time the sample() function is run, it returns a different set of sampling records. The sample() function is used on the data frame with "123" and "456" as slices. In the "123" slice, the sampling returns are the same and in the "456" slice number, the sampling returns are the same. By using the sampleBy() method, It returns a sampling fraction for each stratum. Also, If the stratum is not specified then it takes zero as default.

What Users are saying..

profile image

Ray han

Tech Leader | Stanford / Yale University
linkedin profile url

I think that they are fantastic. I attended Yale and Stanford and have worked at Honeywell,Oracle, and Arthur Andersen(Accenture) in the US. I have taken Big Data and Hadoop,NoSQL, Spark, Hadoop... Read More

Relevant Projects

Build an Incremental ETL Pipeline with AWS CDK
Learn how to build an Incremental ETL Pipeline with AWS CDK using Cryptocurrency data

AWS Project - Build an ETL Data Pipeline on AWS EMR Cluster
Build a fully working scalable, reliable and secure AWS EMR complex data pipeline from scratch that provides support for all data stages from data collection to data analysis and visualization.

Streaming Data Pipeline using Spark, HBase and Phoenix
Build a Real-Time Streaming Data Pipeline for an application that monitors oil wells using Apache Spark, HBase and Apache Phoenix .

dbt Snowflake Project to Master dbt Fundamentals in Snowflake
DBT Snowflake Project to Master the Fundamentals of DBT and learn how it can be used to build efficient and robust data pipelines with Snowflake.

Build an Analytical Platform for eCommerce using AWS Services
In this AWS Big Data Project, you will use an eCommerce dataset to simulate the logs of user purchases, product views, cart history, and the user’s journey to build batch and real-time pipelines.

Build a Spark Streaming Pipeline with Synapse and CosmosDB
In this Spark Streaming project, you will learn to build a robust and scalable spark streaming pipeline using Azure Synapse Analytics and Azure Cosmos DB and also gain expertise in window functions, joins, and logic apps for comprehensive real-time data analysis and processing.

AWS Snowflake Data Pipeline Example using Kinesis and Airflow
Learn to build a Snowflake Data Pipeline starting from the EC2 logs to storage in Snowflake and S3 post-transformation and processing through Airflow DAGs

Python and MongoDB Project for Beginners with Source Code-Part 1
In this Python and MongoDB Project, you learn to do data analysis using PyMongo on MongoDB Atlas Cluster.

Hive Mini Project to Build a Data Warehouse for e-Commerce
In this hive project, you will design a data warehouse for e-commerce application to perform Hive analytics on Sales and Customer Demographics data using big data tools such as Sqoop, Spark, and HDFS.

SQL Project for Data Analysis using Oracle Database-Part 4
In this SQL Project for Data Analysis, you will learn to efficiently write queries using WITH clause and analyse data using SQL Aggregate Functions and various other operators like EXISTS, HAVING.