How to drop out highly correlated features in Python?

How to drop out highly correlated features in Python?

How to drop out highly correlated features in Python?

This recipe helps you drop out highly correlated features in Python


Recipe Objective

In many datasets we find some of the features which are highly correlated that means which are some what linearly dependent with other features. These features contribute very less in predicting the output but increses the computational cost.

This data science python source code does the following:
1. Calculates correlation between different features.
2. Drops highly correlated features to escape curse of dimensionality.
3. Linear and non-linear correlation.

So we have to find out the correlation between the features and remove the features which have correlation coefficient greater than a certain limit.

So this recipe is a short example of how to find the correlation between the features and remove the highly correlated features.

Step 1 - Import the library

import pandas as pd import numpy as np from sklearn import datasets

We have imported numpy, pandas and datasets. We will use datasets to get the inbuilt iris dataset.

Step 2 - Setup the Data

Here we have used datasets to load the inbuilt iris dataset and we have created objects X and y to store the data and the target value respectively. With the data in X we have created a dataframe and printing the first five rows. iris = datasets.load_iris() X = y = df = pd.DataFrame(X) print(df.head())

Step 3 - Creating the Correlation matrix and Selecting the Upper trigular matrix

So now we are creating a square matrix with dimensions equal to the number of features. In which we will have the elements as the absolute value of correlation between the features. cor_matrix = df.corr().abs() print(cor_matrix)

Note that Correlation matrix will be mirror image about the diagonal and all the diagonal elements will be 1. So, It does not matter that we select the upper triangular or lower triangular part of the correlation matrix but we should not include the diagonal elements. So we are selecting the upper traingular. upper_tri = cor_matrix.where(np.triu(np.ones(cor_matrix.shape),k=1).astype(np.bool)) print(upper_tri)

Step 5 - Droping the column with high correlation

So we are selecting the columns which are having absolute correlation greater than 0.95 and making a list of those columns named 'to_drop'. to_drop = [column for column in upper_tri.columns if any(upper_tri[column] > 0.95)] print(); print(to_drop) Now we are droping the columns which are in the list 'to_drop' from the dataframe df1 = df.drop(df.columns[to_drop], axis=1) print(); print(df1.head())

Step 6 - Analysing the output

In the output, initially there will be the dataframe with 4 columns. Then there will be the correlation matrix in which we can observe all diagonal elements as 1 and the upper triangular and lower triangular are the mirror image. After that there will be upper triangular matrix and the final dataframe with the highly correlated columns removed.

     0    1    2    3
0  5.1  3.5  1.4  0.2
1  4.9  3.0  1.4  0.2
2  4.7  3.2  1.3  0.2
3  4.6  3.1  1.5  0.2
4  5.0  3.6  1.4  0.2

          0         1         2         3
0  1.000000  0.117570  0.871754  0.817941
1  0.117570  1.000000  0.428440  0.366126
2  0.871754  0.428440  1.000000  0.962865
3  0.817941  0.366126  0.962865  1.000000

    0        1         2         3
0 NaN  0.11757  0.871754  0.817941
1 NaN      NaN  0.428440  0.366126
2 NaN      NaN       NaN  0.962865
3 NaN      NaN       NaN       NaN


     0    1    2
0  5.1  3.5  1.4
1  4.9  3.0  1.4
2  4.7  3.2  1.3
3  4.6  3.1  1.5
4  5.0  3.6  1.4

Relevant Projects

Predict Employee Computer Access Needs in Python
Data Science Project in Python- Given his or her job role, predict employee access needs using amazon employee database.

Time Series Forecasting with LSTM Neural Network Python
Deep Learning Project- Learn to apply deep learning paradigm to forecast univariate time series data.

Census Income Data Set Project - Predict Adult Census Income
Use the Adult Income dataset to predict whether income exceeds 50K yr based on census data.

Ecommerce product reviews - Pairwise ranking and sentiment analysis
This project analyzes a dataset containing ecommerce product reviews. The goal is to use machine learning models to perform sentiment analysis on product reviews and rank them based on relevance. Reviews play a key role in product recommendation systems.

Forecast Inventory demand using historical sales data in R
In this machine learning project, you will develop a machine learning model to accurately forecast inventory demand based on historical sales data.

Predict Macro Economic Trends using Kaggle Financial Dataset
In this machine learning project, you will uncover the predictive value in an uncertain world by using various artificial intelligence, machine learning, advanced regression and feature transformation techniques.

Data Science Project on Wine Quality Prediction in R
In this R data science project, we will explore wine dataset to assess red wine quality. The objective of this data science project is to explore which chemical properties will influence the quality of red wines.

Demand prediction of driver availability using multistep time series analysis
In this supervised learning machine learning project, you will predict the availability of a driver in a specific area by using multi step time series analysis.

Data Science Project-TalkingData AdTracking Fraud Detection
Machine Learning Project in R-Detect fraudulent click traffic for mobile app ads using R data science programming language.

Natural language processing Chatbot application using NLTK for text classification
In this NLP AI application, we build the core conversational engine for a chatbot. We use the popular NLTK text classification library to achieve this.