How to find frequent items using PySpark

This recipe helps you find frequent items using PySpark

Recipe Objective: How to find frequent items using PySpark?

This recipe teaches us to find frequent items in a spark dataframe using Pyspark and the fpGrowth function. FP-Growth Algorithm is an alternative way to find frequent itemsets without using candidate generations, thus improving performance. This algorithm works as follows: first, it compresses the input database creating an FP-tree instance to represent frequent items.

ETL Orchestration on AWS using Glue and Step Functions

Prerequisites:

Before proceeding with the recipe, make sure the following installations are done on your local EC2 instance.

Steps to set up an environment:

  • In the AWS, create an EC2 instance and log in to Cloudera Manager with your public IP mentioned in the EC2 instance. Login to putty/terminal and check if PySpark is installed. If not installed, please find the links provided above for installations.
  • Type “&ltyour public IP&gt:7180” in the web browser and log in to Cloudera Manager, where you can check if Hadoop, Hive, and Spark are installed.
  • If they are not visible in the Cloudera cluster, you may add them by clicking on the “Add Services” in the cluster to add the required services in your local instance.

Finding frequent items:

Setup the environment variables for Pyspark, Java, Spark, and python library. As shown below:

bigdata_1

Please note that these paths may vary in one’s EC2 instance. Provide the full path where these are stored in your instance.

Import the Spark session and initialize it. You can name your application and master program at this step. We provide appName as “demo,” and the master program is set as “local” in this recipe.

bigdata_2

Import the necessary libraries and create a dataframe with two columns - “id” and “items” containing sample data.

bigdata_3

We see from the sample data that some items are repetitively ordered, and let us now find the frequency of each item. Before doing this, we make sure that the dataframe is fit by calling the fpGrowth() function.

In this recipe, we check-

  • frequency of sets of items in the dataframe using freqItemsets() function

bigdata_4

  • The association rule in the dataframe which returns the confidence interval, using associationRules() function and,

bigdata_5

  • Call the transform() function over the dataframe to examine the input items against all the association rules and summarize the consequents as predictions.

bigdata_6

What Users are saying..

profile image

Ed Godalle

Director Data Analytics at EY / EY Tech
linkedin profile url

I am the Director of Data Analytics with over 10+ years of IT experience. I have a background in SQL, Python, and Big Data working with Accenture, IBM, and Infosys. I am looking to enhance my skills... Read More

Relevant Projects

Azure Data Factory and Databricks End-to-End Project
Azure Data Factory and Databricks End-to-End Project to implement analytics on trip transaction data using Azure Services such as Data Factory, ADLS Gen2, and Databricks, with a focus on data transformation and pipeline resiliency.

AWS Project-Website Monitoring using AWS Lambda and Aurora
In this AWS Project, you will learn the best practices for website monitoring using AWS services like Lambda, Aurora MySQL, Amazon Dynamo DB and Kinesis.

Build Classification and Clustering Models with PySpark and MLlib
In this PySpark Project, you will learn to implement pyspark classification and clustering model examples using Spark MLlib.

GCP Project to Learn using BigQuery for Exploring Data
Learn using GCP BigQuery for exploring and preparing data for analysis and transformation of your datasets.

Real-time Auto Tracking with Spark-Redis
Spark Project - Discuss real-time monitoring of taxis in a city. The real-time data streaming will be simulated using Flume. The ingestion will be done using Spark Streaming.

Real-Time Streaming of Twitter Sentiments AWS EC2 NiFi
Learn to perform 1) Twitter Sentiment Analysis using Spark Streaming, NiFi and Kafka, and 2) Build an Interactive Data Visualization for the analysis using Python Plotly.

Build a Real-Time Dashboard with Spark, Grafana, and InfluxDB
Use Spark , Grafana, and InfluxDB to build a real-time e-commerce users analytics dashboard by consuming different events such as user clicks, orders, demographics

Airline Dataset Analysis using PySpark GraphFrames in Python
In this PySpark project, you will perform airline dataset analysis using graphframes in Python to find structural motifs, the shortest route between cities, and rank airports with PageRank.

Build a Spark Streaming Pipeline with Synapse and CosmosDB
In this Spark Streaming project, you will learn to build a robust and scalable spark streaming pipeline using Azure Synapse Analytics and Azure Cosmos DB and also gain expertise in window functions, joins, and logic apps for comprehensive real-time data analysis and processing.

SQL Project for Data Analysis using Oracle Database-Part 7
In this SQL project, you will learn to perform various data wrangling activities on an ecommerce database.