How to drop out highly correlated features in Python?

This recipe provides information on how to drop out highly correlated features in Python.
Last Updated: 05 Apr 2023

Get access to Data Science projects View all Data Science projects

FEATURE EXTRACTION DATA CLEANING PYTHON DATA MUNGING MACHINE LEARNING RECIPES PANDAS CHEATSHEET ALL TAGS

Recipe Objective - How to Drop Highly Correlated Features in Python?
Why Remove Highly Correlated Variables?
Steps to Find Correlation between the Features and Remove the Highly Correlated Features
Table of Contents
How to Remove Correlated Features in Python?
Let us explore how to remove highly correlated variables in Python using various techniques.
FAQs

Recipe Objective - How to Drop Highly Correlated Features in Python?

Do you want to improve the performance and accuracy of your machine-learning model? One of the most effective ways to do this is by dropping correlated features. Whether a beginner or an experienced data scientist you'll learn valuable techniques to take your models to the next level.

Why Remove Highly Correlated Variables?

In data science, it's essential to understand how to deal with highly correlated features. Highly correlated features refer to variables that have a strong linear relationship with each other. When two or more variables are highly correlated, they carry almost the same information, making it redundant to include all of them in a model. This can lead to a problem known as multicollinearity where it becomes difficult to determine the independent effect of each variable on the target variable. To address this issue, removing features with high correlation is advisable to improve the efficiency and accuracy of the prediction model. Thus, removing highly correlated variables is a crucial step in data preprocessing and can help improve the performance of your machine-learning models.

This data science python source code does the following:
1. Calculates the correlation between different features.
2. Drops highly correlated features to escape the curse of dimensionality.
3. Linear and non-linear correlation.

So we must find the correlation between the features and remove the highly correlated features with a correlation coefficient more significant than a specific limit.

Steps to Find Correlation between the Features and Remove the Highly Correlated Features

So, here are the essential steps for finding the correlation between the features and removing the highly correlated features:

Step 1 - Import the library
Step 2 - Setup the Data
Step 3 - Creating the Correlation matrix and Selecting the Upper triangular matrix
Step 5 - Droping the column with a high correlation
Step 6 - Analysing the output

Start your journey as a Data Scientist today with solved end-to-end Data Science Projects

Step 1 - Import the library

import pandas as pd import numpy as np from sklearn import datasets

We have imported numpy, pandas and datasets. We will use datasets to get the inbuilt iris dataset.

Step 2 - Setup the Data

We have used datasets to load the inbuilt iris dataset and created objects X and y to store the data and the target value, respectively. With the data in X, we have created a dataframe and printed the first five rows. iris = datasets.load_iris() X = iris.data y = iris.target df = pd.DataFrame(X) print(df.head())

Step 3 - Creating the Correlation matrix and Selecting the Upper triangular matrix

So now we are creating a square matrix with dimensions equal to the number of features. We will have the elements as the absolute correlation value between the features. cor_matrix = df.corr().abs() print(cor_matrix)

Note that the Correlation matrix will be a mirror image of the diagonal, and all the diagonal elements will be 1. So, It does not matter that we select the upper or lower triangular part of the correlation matrix, but we should not include the diagonal elements. So we are selecting the upper traingular. upper_tri = cor_matrix.where(np.triu(np.ones(cor_matrix.shape),k=1).astype(np.bool)) print(upper_tri)

Step 5 - Droping the column with a high correlation

So we are selecting the columns with an absolute correlation greater than 0.95 and making a list of those columns named 'to_drop'. to_drop = [column for column in upper_tri.columns if any(upper_tri[column] > 0.95)] print(); print(to_drop) Now we are dropping the columns which are in the list 'to_drop' from the dataframe df1 = df.drop(df.columns[to_drop], axis=1) print(); print(df1.head())

Step 6 - Analysing the output

Initially, there will be a dataframe with 4 columns in the output. Then there will be the correlation matrix where we can observe all diagonal elements as 1, and the upper and lower triangular is the mirror image. After that, the upper triangular matrix and the final dataframe with the highly correlated columns were removed.

0 1 2 3

0 5.1 3.5 1.4 0.2

1 4.9 3.0 1.4 0.2

2 4.7 3.2 1.3 0.2

3 4.6 3.1 1.5 0.2

4 5.0 3.6 1.4 0.2

0 1 2 3

0 1.000000 0.117570 0.871754 0.817941

1 0.117570 1.000000 0.428440 0.366126

2 0.871754 0.428440 1.000000 0.962865

3 0.817941 0.366126 0.962865 1.000000

0 1 2 3

0 NaN 0.11757 0.871754 0.817941

1 NaN NaN 0.428440 0.366126

2 NaN NaN NaN 0.962865

3 NaN NaN NaN NaN

[3]

0 1 2

0 5.1 3.5 1.4

1 4.9 3.0 1.4

2 4.7 3.2 1.3

3 4.6 3.1 1.5

4 5.0 3.6 1.4

Struggling with solved data science projects? Check out these data science projects with source code in Python today!

How to Remove Correlated Features in Python?

Let us explore how to remove highly correlated variables in Python using various techniques.

Find highly correlated columns in Pandas

The first step is to find the highly correlated columns in your dataset. Pandas provides a handy method called corr() to calculate the correlation between columns in a dataframe. The resulting correlation matrix will have a value of 1 on the diagonal since each variable is perfectly correlated with itself.

import pandas as pd

# load dataset

df = pd.read_csv('dataset.csv')

# calculate correlation matrix

corr_matrix = df.corr().abs()

# print correlation matrix

print(corr_matrix)

The resulting correlation matrix will have values between 0 and 1, where 0 indicates no correlation and 1 indicates perfect correlation. We can use the corr_matrix to identify the highly correlated features.

Remove collinear features in Python.

Once you've identified the highly correlated features, the next step is to remove them. There are different techniques too. Let's explore some of them:

Dropping columns using Pandas

The simplest way to remove highly correlated features is to drop one of the highly correlated features with another. We can do this using the Pandas drop() method.

# get upper triangle of correlation matrix

upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))

# find features with correlation greater than 0.95

to_drop = [column for column in upper.columns if any(upper[column] > 0.95)]

# drop highly correlated features

df.drop(to_drop, axis=1, inplace=True)

Using the variance inflation factor (VIF)

The variance inflation factor (VIF) is another way to detect multicollinearity between features. The VIF measures how much the variance of the estimated regression coefficients is increased because of multicollinearity in the model. If the VIF of a feature is greater than 5 or 10, it indicates that the feature is highly correlated with other features in the model.

from statsmodels.stats.outliers_influence import variance_inflation_factor

# calculate VIF for each feature

vif = pd.DataFrame()

vif["VIF Factor"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

vif["features"] = X.columns

# print VIF values

print(vif)

In this example, we're using the variance_inflation_factor() method from the statsmodels package to calculate the VIF values for each feature. We then create a dataframe with the VIF values and feature names and print them.

Using scikit-learn's SelectKBest method

Scikit-learn provides a SelectKBest method that can be used to select the k most important features based on a scoring function. This method can select the essential features and remove the highly correlated ones.

from sklearn.feature_selection import SelectKBest

from sklearn.feature_selection import f_regression

# define number of features to keep

k = 10

# perform feature selection

X_new = SelectKBest(f_regression, k=k).fit_transform(X, y)

# get feature names of selected features

selected_features = X.columns[SelectKBest(f_regression, k=k).fit(X, y).get_support()]

# print selected features

print(selected_features)

In this example, we use the f_regression scoring function to select the k most essential features. We then use the get_support() method to get a boolean mask of the selected features and use this to get the feature names.

Join Millions of Satisfied Developers and Enterprises to Maximize Your Productivity and ROI with ProjectPro - Read ProjectPro Reviews Now!

Looking for end to end solved data science projects? Check out ProjectPro's repository of solved Data Science Projects with Source Code!

FAQs

How do you drop highly correlated features in Python?

You can drop highly collinear features in Python by using the correlation matrix to identify the pairs of features that are highly correlated and then drop one of the two features from each highly correlated pair. You can use the corr() function from the pandas library to calculate the correlation matrix and then the drop() function to remove the features with high correlation from the dataframe.

How do I remove highly correlated features?

To remove highly correlated features, you can use techniques like correlation matrix, scatter plot matrix, or heatmap to identify the highly correlated features. Then, you can drop one of the two features from each highly correlated pair using the drop() function in pandas.

How do I remove highly correlated features in pandas?

You can remove features with high correlation in pandas using the corr() function to calculate the correlation matrix, then drop the highly correlated features using the drop() function.

How to drop out highly correlated features in python?

To drop highly correlated features in Python, you can use the correlation matrix to identify the pairs of features that are highly correlated and then drop one of the two features from each highly correlated pair using the drop() function.

How to find highly correlated variables in python?

To find highly correlated variables in Python, you can use the correlation matrix to identify the highly correlated pairs.

Join Millions of Satisfied Developers and Enterprises to Maximize Your Productivity and ROI with ProjectPro - Read ProjectPro Reviews Now!

Download Materials

iPython Notebook

What Users are saying..

Ameeruddin Mohammed

ETL (Abintio) developer at IBM

I come from a background in Marketing and Analytics and when I developed an interest in Machine Learning algorithms, I did multiple in-class courses from reputed institutions though I got good... Read More

Relevant Projects

Machine Learning Projects

Data Science Projects

Python Projects for Data Science

Data Science Projects in R

Machine Learning Projects for Beginners

Deep Learning Projects

Neural Network Projects

Tensorflow Projects

NLP Projects

Kaggle Projects

IoT Projects

Big Data Projects

Hadoop Real-Time Projects Examples

Spark Projects

Data Analytics Projects for Students

Relevant Projects

Census Income Data Set Project-Predict Adult Census Income

Use the Adult Income dataset to predict whether income exceeds 50K yr based oncensus data.

View Project Details

Machine Learning project for Retail Price Optimization

In this machine learning pricing project, we implement a retail price optimization algorithm using regression trees. This is one of the first steps to building a dynamic pricing model.

View Project Details

Build a Multi Touch Attribution Machine Learning Model in Python

Identifying the ROI on marketing campaigns is an essential KPI for any business. In this ML project, you will learn to build a Multi Touch Attribution Model in Python to identify the ROI of various marketing efforts and their impact on conversions or sales..

View Project Details

How to drop out highly correlated features in Python?

Table of Contents

Recipe Objective - How to Drop Highly Correlated Features in Python?

Why Remove Highly Correlated Variables?

Steps to Find Correlation between the Features and Remove the Highly Correlated Features

Table of Contents

Step 1 - Import the library

Step 2 - Setup the Data

Step 3 - Creating the Correlation matrix and Selecting the Upper triangular matrix

Step 5 - Droping the column with a high correlation

Step 6 - Analysing the output

How to Remove Correlated Features in Python?

Let us explore how to remove highly correlated variables in Python using various techniques.

Find highly correlated columns in Pandas

Remove collinear features in Python.

Dropping columns using Pandas

Using the variance inflation factor (VIF)

Using scikit-learn's SelectKBest method

FAQs

How do you drop highly correlated features in Python?

How do I remove highly correlated features?

How do I remove highly correlated features in pandas?