How to find top 10 rows by random sampling using pandas?
BIG DATA RECIPES DATA CLEANING PYTHON DATA MUNGING MACHINE LEARNING RECIPES PANDAS CHEATSHEET     ALL TAGS

How to find top 10 rows by random sampling using pandas?

How to find top 10 rows by random sampling using pandas?

This recipe helps you find top 10 rows by random sampling using pandas

0

Recipe Objective

In most of the big data scenarios , random sampling is a method used to select the objects in the data in which there is an equal probability of selecting any particular item. Here, each number has an equal probability of getting picked. Random sampling method is used to give an unbiased representation of the population. In this recipe we are sampling data based on certain ranking criteria and picking top 10 rows after sampling the data.

System requirements :

  • Install the python module as follows if the below modules are not found:
  • pip install pandas
  • pip install numpy
  • The below codes can be run in Jupyter notebook, or any python console

Step 1: Import the module

In this scenario we are going to use pandas numpy and random libraries import the libraries as below :

import pandas as pd import random import numpy

Step 2 :Prepare the dataset

Here we are using the food related comma separated values dataset to perform the csv file.

Data of Output looks as follows:

Reading the csv file from my local using pandas library as below and storing data as dataframe.

path ='C:\\Users\\DELL E7440\\Desktop\\food_prices.csv' df = pd.read_csv(path)

After loading the csv file into dataframe filter the dataframe for submitted "year" records and load into the dataframe as below

df = df[(df["status"] == 'REVISED')]

Creating a random number columns for a ranking criteria follow as below

df['random_between'] = df.apply(lambda row: random.randint(1, row.year), axis=1)

A new random number is generated which is an addition of the above 2 columns as a separate column.

df['random_number'] = df['random_between'] + df['random']

The above random number is used as a ranking criterion for the data which is grouped by the 'series_reference' column.

df_grouped = df.groupby(['series_reference'])

Rank the rows in each group, based on the generated random numbers

df['Rank'] = df_grouped['random'].rank(method='min')

The rows which have a rank < 11 are captured into the final results data frame.

df = df[(df.Rank<11)]

Completed code as written together in function to get the random records as follows:

import pandas as pd import random import numpy def get_food_data_submission(): # Initialize the client to run the query path='C:\\Users\\DELL E7440\\Desktop\\food_prices.csv' try: # read and Convert the output to Pandas Dataframe df = pd.read_csv(path) # Filter the dataframe for Submitted Records df = df[(df["status"] == 'REVISED')] # Create random number columns for a ranking critiria df['random_between'] = df.apply(lambda row: random.randint(1, row.year), axis=1) df['random'] = df.apply(lambda row: random.uniform(0, 1), axis=1) df['random_number'] = df['random_between'] + df['random'] # Group the data by 'series_reference' df_grouped = df.groupby(['series_reference']) # Rank the rows in each group, based on the generated random numbers df['Rank'] = df_grouped['random'].rank(method='min') # Fetch the top 10 records submitted by the year df = df[(df.Rank<11)] return df except Exception as exp: print("Error occured, ", exp) # calling the function to return the results get_food_data_submission()

Output of the above code as follows:

Relevant Projects

Analyse Yelp Dataset with Spark & Parquet Format on Azure Databricks
In this Databricks Azure project, you will use Spark & Parquet file formats to analyse the Yelp reviews dataset. As part of this you will deploy Azure data factory, data pipelines and visualise the analysis.

Data processing with Spark SQL
In this Apache Spark SQL project, we will go through provisioning data for retrieval using Spark SQL.

Real-Time Log Processing using Spark Streaming Architecture
In this Spark project, we are going to bring processing to the speed layer of the lambda architecture which opens up capabilities to monitor application real time performance, measure real time comfort with applications and real time alert in case of security

Online Hadoop Projects -Solving small file problem in Hadoop
In this hadoop project, we are going to be continuing the series on data engineering by discussing and implementing various ways to solve the hadoop small file problem.

Yelp Data Processing using Spark and Hive Part 2
In this spark project, we will continue building the data warehouse from the previous project Yelp Data Processing Using Spark And Hive Part 1 and will do further data processing to develop diverse data products.

Explore features of Spark SQL in practice on Spark 2.0
The goal of this spark project for students is to explore the features of Spark SQL in practice on the latest version of Spark i.e. Spark 2.0.

Finding Unique URL's using Hadoop Hive
Hive Project -Learn to write a Hive program to find the first unique URL, given 'n' number of URL's.

Tough engineering choices with large datasets in Hive Part - 1
Explore hive usage efficiently in this hadoop hive project using various file formats such as JSON, CSV, ORC, AVRO and compare their relative performances

AWS Project - Build an ETL Data Pipeline on AWS EMR Cluster
Build a fully working scalable, reliable and secure AWS EMR complex data pipeline from scratch that provides support for all data stages from data collection to data analysis and visualization.

PySpark Tutorial - Learn to use Apache Spark with Python
PySpark Project-Get a handle on using Python with Spark through this hands-on data processing spark python tutorial.