How to filter columns from a dataframe using PySpark

This recipe helps you filter columns from a dataframe using PySpark
Last Updated: 11 Nov 2021

Get access to Big Data projects View all Big Data projects

APACHE HADOOP PROJECTS DATA CLEANING PYTHON DATA MUNGING MACHINE LEARNING RECIPES PANDAS CHEATSHEET ALL TAGS

Recipe Objective

In this recipe, we learn how to filter columns from a dataframe using PySpark. The filter() function returns a new dataset formed by selecting those elements of the source on which the function returns true. So, it retrieves only the elements that satisfy the given condition. Let us learn how this can be achieved. Data filtering is majorly used in data cleaning and transformation operations in large-scale distributed big data environments, narrowing the results to only required information in the data, which reduces the query latency on the whole data.

Prerequisites:

Before proceeding with the recipe, make sure the following installations are done on your local EC2 instance.

Single node Hadoop - click here
Apache Hive - click here
Apache Spark -click here
PySpark - click here

Steps to set up an environment:

In the AWS, create an EC2 instance and log in to Cloudera Manager with your public IP mentioned in the EC2 instance. Login to putty/terminal and check if PySpark is installed. If not installed, please find the links provided above for installations.
Type "&ltyour public IP&gt:7180" in the web browser and log in to Cloudera Manager, where you can check if Hadoop, Hive, and Spark are installed.
If they are not visible in the Cloudera cluster, you may add them by clicking on the "Add Services" in the cluster to add the required services in your local instance.

Filtering data in a dataframe using PySpark:

Setup the environment variables for Pyspark, Java, Spark, and python library. As shown below:

bigdata_1

Please note that these paths may vary in one's EC2 instance. Provide the full path where these are stored in your instance.

Import the Spark session and initialize it. You can name your application and master program at this step. We provide appName as "demo," The master program is set as "local" in this recipe.

bigdata_2

We demonstrated this recipe using a CSV file, "users.csv," in the HDFS.
The CSV file is first read and loaded to create a dataframe, and this dataframe is examined to know its schema (using printSchema() method) and to check the data present in it(using show()).

bigdata_3

The filter() function selects specific data from the dataframe based on a given condition. Here, we selected only those columns from the users.csv file where the user's job is "Engineer."

bigdata_4

You may also store this data that is selected based on a condition into a new dataframe, which may later be used for querying.

bigdata_5

This is how data can be filtered from a dataframe using PySpark.

Download Materials

bigdata_1

bigdata_2

bigdata_3

bigdata_4

bigdata_5

users

What Users are saying..

Abhinav Agarwal

Graduate Student at Northwestern University

I come from Northwestern University, which is ranked 9th in the US. Although the high-quality academics at school taught me all the basics I needed, obtaining practical experience was a challenge.... Read More

Relevant Projects

Machine Learning Projects

Data Science Projects

Python Projects for Data Science

Data Science Projects in R

Machine Learning Projects for Beginners

Deep Learning Projects

Neural Network Projects

Tensorflow Projects

NLP Projects

Kaggle Projects

IoT Projects

Big Data Projects

Hadoop Real-Time Projects Examples

Spark Projects

Data Analytics Projects for Students

Relevant Projects

AWS CDK and IoT Core for Migrating IoT-Based Data to AWS

Learn how to use AWS CDK and various AWS services to replicate an On-Premise Data Center infrastructure by ingesting real-time IoT-based.

View Project Details

Build a Streaming Pipeline with DBT, Snowflake and Kinesis

This dbt project focuses on building a streaming pipeline integrating dbt Cloud, Snowflake and Amazon Kinesis for real-time processing and analysis of Stock Market Data.

View Project Details

Data Processing and Transformation in Hive using Azure VM

Hive Practice Example - Explore hive usage efficiently for data transformation and processing in this big data project using Azure VM.

View Project Details

A Hands-On Approach to Learn Apache Spark using Scala

Get Started with Apache Spark using Scala for Big Data Analysis

View Project Details

Flask API Big Data Project using Databricks and Unity Catalog

In this Flask Project, you will use Flask APIs, Databricks, and Unity Catalog to build a secure data processing platform focusing on climate data. You will also explore advanced features like Docker containerization, data encryption, and detailed data lineage tracking.

View Project Details

Spark Project-Analysis and Visualization on Yelp Dataset

The goal of this Spark project is to analyze business reviews from Yelp dataset and ingest the final output of data processing in Elastic Search.Also, use the visualisation tool in the ELK stack to visualize various kinds of ad-hoc reports from the data.

View Project Details

How to filter columns from a dataframe using PySpark

Recipe Objective

Prerequisites:

Steps to set up an environment:

Filtering data in a dataframe using PySpark:

Abhinav Agarwal

Relevant Projects

You might also like

Relevant Projects