How to deal with outliers in Python?
DATA MUNGING DATA CLEANING PYTHON MACHINE LEARNING RECIPES PANDAS CHEATSHEET     ALL TAGS

How to deal with outliers in Python?

How to deal with outliers in Python?

This recipe helps you deal with outliers in Python

0

Recipe Objective

In many dataset we find that there are some values in features which are outliers that means they are very large or small as compared to rest of the data. Some values are also out of the range of the feature, so they are also considered as outliers. Outliers effects our model's efficiency because it influences the model very much.

This data science python source code does the following:
1. Imports pandas and numpy libraries.
2. Creates your own dataframe using pandas.
3.Outliers handling by dropping them.
4. Outliers handling using boolean marking.
5. Outliers handling using Rescalinf of features.

So this is the recipe on how we can deal with outliers in Python

Step 1 - Import the library

import numpy as np import pandas as pd

We have imported numpy and pandas. These two modules will be required.

Step 2 - Creating DataFrame

We have first created an empty dataframe named farm then added features and values to it. We can clearly see that in feature Rooms the value 100 is an outlier. farm = pd.DataFrame() farm['Price'] = [632541, 425618, 356471, 7412512] farm['Rooms'] = [2, 5, 3, 100] farm['Square_Feet'] = [1600, 2850, 1780, 90000] print(farm)

Method 1 - Droping the outliers

There are various ways to deal with outliers and one of them is to droping the outliers by appling some conditions on features. h = farm[farm['Rooms'] < 20] print(h) Here we have applied the condition on feature room that to select only the values which are less than 20.

Method 2 - Marking the Outliers

We can also mark the outliers and will not use that outliers in training the model. Here we are using bool to mark the outlier based on some condition. farm['Outlier'] = np.where(farm['Rooms'] < 20, 0, 1) print(farm)

Method 3 - Rescaling the data

We can not use upper two methods when we have less data points in that case we can not afford to drop or mark the outliers. Here we can rescale the data so that the outliers can be used. farm['Log_Of_Square_Feet'] = [np.log(x) for x in farm['Square_Feet']] print(farm) So the final output of all the methods are

     Price  Rooms  Square_Feet
0   632541      2         1600
1   425618      5         2850
2   356471      3         1780
3  7412512    100        90000

    Price  Rooms  Square_Feet
0  632541      2         1600
1  425618      5         2850
2  356471      3         1780

     Price  Rooms  Square_Feet  Outlier
0   632541      2         1600        0
1   425618      5         2850        0
2   356471      3         1780        0
3  7412512    100        90000        1

     Price  Rooms  Square_Feet  Outlier  Log_Of_Square_Feet
0   632541      2         1600        0            7.377759
1   425618      5         2850        0            7.955074
2   356471      3         1780        0            7.484369
3  7412512    100        90000        1           11.407565

Relevant Projects

Deep Learning with Keras in R to Predict Customer Churn
In this deep learning project, we will predict customer churn using Artificial Neural Networks and learn how to model an ANN in R with the keras deep learning package.

Predict Macro Economic Trends using Kaggle Financial Dataset
In this machine learning project, you will uncover the predictive value in an uncertain world by using various artificial intelligence, machine learning, advanced regression and feature transformation techniques.

Loan Eligibility Prediction using Gradient Boosting Classifier
This data science in python project predicts if a loan should be given to an applicant or not. We predict if the customer is eligible for loan based on several factors like credit score and past history.

Predict Credit Default | Give Me Some Credit Kaggle
In this data science project, you will predict borrowers chance of defaulting on credit loans by building a credit score prediction model.

Human Activity Recognition Using Smartphones Data Set
In this deep learning project, you will build a classification system where to precisely identify human fitness activities.

Ensemble Machine Learning Project - All State Insurance Claims Severity Prediction
In this ensemble machine learning project, we will predict what kind of claims an insurance company will get. This is implemented in python using ensemble machine learning algorithms.

Machine Learning project for Retail Price Optimization
In this machine learning pricing project, we implement a retail price optimization algorithm using regression trees. This is one of the first steps to building a dynamic pricing model.

Walmart Sales Forecasting Data Science Project
Data Science Project in R-Predict the sales for each department using historical markdown data from the Walmart dataset containing data of 45 Walmart stores.

German Credit Dataset Analysis to Classify Loan Applications
In this data science project, you will work with German credit dataset using classification techniques like Decision Tree, Neural Networks etc to classify loan applications using R.

Predict Employee Computer Access Needs in Python
Data Science Project in Python- Given his or her job role, predict employee access needs using amazon employee database.