Explain the sample and sampleBy functions in PySpark in Databricks

This recipe explains what the sample and sampleBy functions in PySpark in Databricks
Last Updated: 12 May 2023

Get access to Big Data projects View all Big Data projects

APACHE SPARK PROJECTS DATA CLEANING PYTHON DATA MUNGING MACHINE LEARNING RECIPES PANDAS CHEATSHEET ALL TAGS

Recipe Objective - Explain the sample() and sampleBy() functions in PySpark in Databricks?

In PySpark, the sampling (pyspark.sql.DataFrame.sample()) is the widely used mechanism to get the random sample records from the dataset and it is most helpful when there is a larger dataset and the analysis or test of the subset of the data is required that is for example 15% of the original file. The syntax of the sample() file is "sample(with replacement, fraction, seed=None)" in which "fraction" is defined as the Fraction of rows to generate in the range [0.0, 1.0] and it doesn’t guarantee to provide the exact number of a fraction of records. The "seed" is used for sampling (default a random seed) and is further used to reproduce the same random sampling. The "with replacement" is defined as the sample with the replacement or not (default False). The sample() function is defined as the function which is widely used to get Stratified sampling in PySpark without the replacement. Further, it returns the sampling fraction for each stratum and if the stratum is not specified, it takes the zero as the default.

Explore PySpark Machine Learning Tutorial to take your PySpark skills to the next level!

Recipe Objective - Explain the sample() and sampleBy() functions in PySpark in Databricks?
- System Requirements
- Implementing the sample() function and sampleBy() function in Databricks in PySpark

System Requirements

Python (3.0 version)
Apache Spark (3.1.1 version)

This recipe explains what is sample() function, sampleBy() function and explaining the usage of sample() and sampleBy() in PySpark.

Implementing the sample() function and sampleBy() function in Databricks in PySpark

# Importing packages import pyspark from pyspark.sql import SparkSession, Row from pyspark.sql.types import MapType, StringType from pyspark.sql.functions import col from pyspark.sql.types import StructType,StructField, StringType Databricks-1

The Sparksession, Row, MapType, StringType, col, explode, StructType, StructField, StringType are imported in the environment so as to use sample() function and sampleBy() function in PySpark .

# Implementing the sample() function and sampleBy() function in Databricks in PySpark spark = SparkSession.builder \ .master("local[1]") \ .appName("sample() and sampleBy() PySpark") \ .getOrCreate() dataframe = spark.range(100) print(dataframe.sample(0.06).collect()) # Using sample() function print(dataframe.sample(0.1,123).collect()) print(dataframe.sample(0.1,123).collect()) print(dataframe.sample(0.1,456).collect()) # Using the withReplacement(May contain duplicates) ## With Duplicates print(dataframe.sample(True,0.3,123).collect()) ## Without Duplicates print(dataframe.sample(0.3,123).collect()) # Using sampleBy() function dataframe2 = dataframe.select((dataframe.id % 3).alias("key")) print(dataframe2.sampleBy("key", {0: 0.1, 1: 0.2},0).collect()) Databricks-2
Databricks-3

The Spark Session is defined. The "data frame" is defined using the random range of 100 numbers and wants to get 6% sample records defined with "0.06". Every time the sample() function is run, it returns a different set of sampling records. The sample() function is used on the data frame with "123" and "456" as slices. In the "123" slice, the sampling returns are the same and in the "456" slice number, the sampling returns are the same. By using the sampleBy() method, It returns a sampling fraction for each stratum. Also, If the stratum is not specified then it takes zero as default.

Download Materials

Databricks_1

Databricks_2

Databricks_3

What Users are saying..

Ray han

Tech Leader | Stanford / Yale University

I think that they are fantastic. I attended Yale and Stanford and have worked at Honeywell,Oracle, and Arthur Andersen(Accenture) in the US. I have taken Big Data and Hadoop,NoSQL, Spark, Hadoop... Read More

Relevant Projects

Machine Learning Projects

Data Science Projects

Python Projects for Data Science

Data Science Projects in R

Machine Learning Projects for Beginners

Deep Learning Projects

Neural Network Projects

Tensorflow Projects

NLP Projects

Kaggle Projects

IoT Projects

Big Data Projects

Hadoop Real-Time Projects Examples

Spark Projects

Data Analytics Projects for Students

Relevant Projects

Build an Incremental ETL Pipeline with AWS CDK

Learn how to build an Incremental ETL Pipeline with AWS CDK using Cryptocurrency data

View Project Details

AWS Project - Build an ETL Data Pipeline on AWS EMR Cluster

Build a fully working scalable, reliable and secure AWS EMR complex data pipeline from scratch that provides support for all data stages from data collection to data analysis and visualization.

View Project Details

Streaming Data Pipeline using Spark, HBase and Phoenix

Build a Real-Time Streaming Data Pipeline for an application that monitors oil wells using Apache Spark, HBase and Apache Phoenix .

View Project Details

dbt Snowflake Project to Master dbt Fundamentals in Snowflake

DBT Snowflake Project to Master the Fundamentals of DBT and learn how it can be used to build efficient and robust data pipelines with Snowflake.

View Project Details

Build an Analytical Platform for eCommerce using AWS Services

In this AWS Big Data Project, you will use an eCommerce dataset to simulate the logs of user purchases, product views, cart history, and the user’s journey to build batch and real-time pipelines.

View Project Details

Build a Spark Streaming Pipeline with Synapse and CosmosDB

In this Spark Streaming project, you will learn to build a robust and scalable spark streaming pipeline using Azure Synapse Analytics and Azure Cosmos DB and also gain expertise in window functions, joins, and logic apps for comprehensive real-time data analysis and processing.

View Project Details

Explain the sample and sampleBy functions in PySpark in Databricks

Recipe Objective - Explain the sample() and sampleBy() functions in PySpark in Databricks?

Table of Contents

System Requirements

Implementing the sample() function and sampleBy() function in Databricks in PySpark

Ray han

Relevant Projects

You might also like

Relevant Projects