How to filter columns from a dataframe using PySpark

This recipe helps you filter columns from a dataframe using PySpark

Recipe Objective

In this recipe, we learn how to filter columns from a dataframe using PySpark. The filter() function returns a new dataset formed by selecting those elements of the source on which the function returns true. So, it retrieves only the elements that satisfy the given condition. Let us learn how this can be achieved. Data filtering is majorly used in data cleaning and transformation operations in large-scale distributed big data environments, narrowing the results to only required information in the data, which reduces the query latency on the whole data.

Prerequisites:

Before proceeding with the recipe, make sure the following installations are done on your local EC2 instance.

Steps to set up an environment:

  • In the AWS, create an EC2 instance and log in to Cloudera Manager with your public IP mentioned in the EC2 instance. Login to putty/terminal and check if PySpark is installed. If not installed, please find the links provided above for installations.
  • Type "&ltyour public IP&gt:7180" in the web browser and log in to Cloudera Manager, where you can check if Hadoop, Hive, and Spark are installed.
  • If they are not visible in the Cloudera cluster, you may add them by clicking on the "Add Services" in the cluster to add the required services in your local instance.

Filtering data in a dataframe using PySpark:

Setup the environment variables for Pyspark, Java, Spark, and python library. As shown below:

bigdata_1

Please note that these paths may vary in one's EC2 instance. Provide the full path where these are stored in your instance.

Import the Spark session and initialize it. You can name your application and master program at this step. We provide appName as "demo," The master program is set as "local" in this recipe.

bigdata_2

We demonstrated this recipe using a CSV file, "users.csv," in the HDFS.
The CSV file is first read and loaded to create a dataframe, and this dataframe is examined to know its schema (using printSchema() method) and to check the data present in it(using show()).

bigdata_3

The filter() function selects specific data from the dataframe based on a given condition. Here, we selected only those columns from the users.csv file where the user's job is "Engineer."

bigdata_4

You may also store this data that is selected based on a condition into a new dataframe, which may later be used for querying.

bigdata_5

This is how data can be filtered from a dataframe using PySpark.

What Users are saying..

profile image

Abhinav Agarwal

Graduate Student at Northwestern University
linkedin profile url

I come from Northwestern University, which is ranked 9th in the US. Although the high-quality academics at school taught me all the basics I needed, obtaining practical experience was a challenge.... Read More

Relevant Projects

AWS CDK and IoT Core for Migrating IoT-Based Data to AWS
Learn how to use AWS CDK and various AWS services to replicate an On-Premise Data Center infrastructure by ingesting real-time IoT-based.

Build a Streaming Pipeline with DBT, Snowflake and Kinesis
This dbt project focuses on building a streaming pipeline integrating dbt Cloud, Snowflake and Amazon Kinesis for real-time processing and analysis of Stock Market Data.

Data Processing and Transformation in Hive using Azure VM
Hive Practice Example - Explore hive usage efficiently for data transformation and processing in this big data project using Azure VM.

A Hands-On Approach to Learn Apache Spark using Scala
Get Started with Apache Spark using Scala for Big Data Analysis

Flask API Big Data Project using Databricks and Unity Catalog
In this Flask Project, you will use Flask APIs, Databricks, and Unity Catalog to build a secure data processing platform focusing on climate data. You will also explore advanced features like Docker containerization, data encryption, and detailed data lineage tracking.

Spark Project-Analysis and Visualization on Yelp Dataset
The goal of this Spark project is to analyze business reviews from Yelp dataset and ingest the final output of data processing in Elastic Search.Also, use the visualisation tool in the ELK stack to visualize various kinds of ad-hoc reports from the data.

Hadoop Project to Perform Hive Analytics using SQL and Scala
In this hadoop project, learn about the features in Hive that allow us to perform analytical queries over large datasets.

Learn to Create Delta Live Tables in Azure Databricks
In this Microsoft Azure Project, you will learn how to create delta live tables in Azure Databricks.

Hive Mini Project to Build a Data Warehouse for e-Commerce
In this hive project, you will design a data warehouse for e-commerce application to perform Hive analytics on Sales and Customer Demographics data using big data tools such as Sqoop, Spark, and HDFS.

Deploy an Application to Kubernetes in Google Cloud using GKE
In this Kubernetes Big Data Project, you will automate and deploy an application using Docker, Google Kubernetes Engine (GKE), and Google Cloud Functions.