How to drop out highly correlated features in Python?
FEATURE EXTRACTION DATA CLEANING PYTHON DATA MUNGING MACHINE LEARNING RECIPES PANDAS CHEATSHEET     ALL TAGS

# How to drop out highly correlated features in Python?

This recipe helps you drop out highly correlated features in Python

## Recipe Objective

In many datasets we find some of the features which are highly correlated that means which are some what linearly dependent with other features. These features contribute very less in predicting the output but increses the computational cost.

This data science python source code does the following:
1. Calculates correlation between different features.
2. Drops highly correlated features to escape curse of dimensionality.
3. Linear and non-linear correlation.

So we have to find out the correlation between the features and remove the features which have correlation coefficient greater than a certain limit.

So this recipe is a short example of how to find the correlation between the features and remove the highly correlated features.

## Step 1 - Import the library

``` import pandas as pd import numpy as np from sklearn import datasets ```

We have imported numpy, pandas and datasets. We will use datasets to get the inbuilt iris dataset.

## Step 2 - Setup the Data

Here we have used datasets to load the inbuilt iris dataset and we have created objects X and y to store the data and the target value respectively. With the data in X we have created a dataframe and printing the first five rows. ``` iris = datasets.load_iris() X = iris.data y = iris.target df = pd.DataFrame(X) print(df.head()) ```

## Step 3 - Creating the Correlation matrix and Selecting the Upper trigular matrix

So now we are creating a square matrix with dimensions equal to the number of features. In which we will have the elements as the absolute value of correlation between the features. ``` cor_matrix = df.corr().abs() print(cor_matrix) ```

Note that Correlation matrix will be mirror image about the diagonal and all the diagonal elements will be 1. So, It does not matter that we select the upper triangular or lower triangular part of the correlation matrix but we should not include the diagonal elements. So we are selecting the upper traingular. ``` upper_tri = cor_matrix.where(np.triu(np.ones(cor_matrix.shape),k=1).astype(np.bool)) print(upper_tri) ```

## Step 5 - Droping the column with high correlation

So we are selecting the columns which are having absolute correlation greater than 0.95 and making a list of those columns named 'to_drop'. ``` to_drop = [column for column in upper_tri.columns if any(upper_tri[column] > 0.95)] print(); print(to_drop) ``` Now we are droping the columns which are in the list 'to_drop' from the dataframe ``` df1 = df.drop(df.columns[to_drop], axis=1) print(); print(df1.head()) ```

## Step 6 - Analysing the output

In the output, initially there will be the dataframe with 4 columns. Then there will be the correlation matrix in which we can observe all diagonal elements as 1 and the upper triangular and lower triangular are the mirror image. After that there will be upper triangular matrix and the final dataframe with the highly correlated columns removed.

```     0    1    2    3
0  5.1  3.5  1.4  0.2
1  4.9  3.0  1.4  0.2
2  4.7  3.2  1.3  0.2
3  4.6  3.1  1.5  0.2
4  5.0  3.6  1.4  0.2

0         1         2         3
0  1.000000  0.117570  0.871754  0.817941
1  0.117570  1.000000  0.428440  0.366126
2  0.871754  0.428440  1.000000  0.962865
3  0.817941  0.366126  0.962865  1.000000

0        1         2         3
0 NaN  0.11757  0.871754  0.817941
1 NaN      NaN  0.428440  0.366126
2 NaN      NaN       NaN  0.962865
3 NaN      NaN       NaN       NaN



0    1    2
0  5.1  3.5  1.4
1  4.9  3.0  1.4
2  4.7  3.2  1.3
3  4.6  3.1  1.5
4  5.0  3.6  1.4
```

#### Relevant Projects

##### Time Series Python Project using Greykite and Neural Prophet
In this time series project, you will forecast Walmart sales over time using the powerful, fast, and flexible time series forecasting library Greykite that helps automate time series problems.

##### Data Science Project - Instacart Market Basket Analysis
Data Science Project - Build a recommendation engine which will predict the products to be purchased by an Instacart consumer again.

##### Build a Face Recognition System in Python using FaceNet
In this deep learning project, you will build your own face recognition system in Python using OpenCV and FaceNet by extracting features from an image of a person's face.

##### Build a Music Recommendation Algorithm using KKBox's Dataset
Music Recommendation Project using Machine Learning - Use the KKBox dataset to predict the chances of a user listening to a song again after their very first noticeable listening event.

##### Predict Churn for a Telecom company using Logistic Regression
Machine Learning Project in R- Predict the customer churn of telecom sector and find out the key drivers that lead to churn. Learn how the logistic regression model using R can be used to identify the customer churn in telecom dataset.

##### Predict Credit Default | Give Me Some Credit Kaggle
In this data science project, you will predict borrowers chance of defaulting on credit loans by building a credit score prediction model.

##### Locality Sensitive Hashing Python Code for Look-Alike Modelling
In this deep learning project, you will find similar images (lookalikes) using deep learning and locality sensitive hashing to find customers who are most likely to click on an ad.

##### Expedia Hotel Recommendations Data Science Project
In this data science project, you will contextualize customer data and predict the likelihood a customer will stay at 100 different hotel groups.

##### Ensemble Machine Learning Project - All State Insurance Claims Severity Prediction
In this ensemble machine learning project, we will predict what kind of claims an insurance company will get. This is implemented in python using ensemble machine learning algorithms.

##### Credit Card Fraud Detection as a Classification Problem
In this data science project, we will predict the credit card fraud in the transactional dataset using some of the predictive models.