DATA MUNGING
DATA CLEANING PYTHON
MACHINE LEARNING RECIPES
PANDAS CHEATSHEET
ALL TAGS
# How to deal with outliers in Python?

# How to deal with outliers in Python?

This recipe helps you deal with outliers in Python

In many dataset we find that there are some values in features which are outliers that means they are very large or small as compared to rest of the data. Some values are also out of the range of the feature, so they are also considered as outliers. Outliers effects our model's efficiency because it influences the model very much.

This data science python source code does the following:

1. Imports pandas and numpy libraries.

2. Creates your own dataframe using pandas.

3.Outliers handling by dropping them.

4. Outliers handling using boolean marking.

5. Outliers handling using Rescalinf of features.

So this is the recipe on how we can deal with outliers in Python

```
import numpy as np
import pandas as pd
```

We have imported numpy and pandas. These two modules will be required.

We have first created an empty dataframe named farm then added features and values to it. We can clearly see that in feature Rooms the value 100 is an outlier.
```
farm = pd.DataFrame()
farm['Price'] = [632541, 425618, 356471, 7412512]
farm['Rooms'] = [2, 5, 3, 100]
farm['Square_Feet'] = [1600, 2850, 1780, 90000]
print(farm)
```

There are various ways to deal with outliers and one of them is to droping the outliers by appling some conditions on features.
```
h = farm[farm['Rooms'] < 20]
print(h)
```

Here we have applied the condition on feature room that to select only the values which are less than 20.

We can also mark the outliers and will not use that outliers in training the model. Here we are using bool to mark the outlier based on some condition.
```
farm['Outlier'] = np.where(farm['Rooms'] < 20, 0, 1)
print(farm)
```

We can not use upper two methods when we have less data points in that case we can not afford to drop or mark the outliers. Here we can rescale the data so that the outliers can be used.
```
farm['Log_Of_Square_Feet'] = [np.log(x) for x in farm['Square_Feet']]
print(farm)
```

So the final output of all the methods are

Price Rooms Square_Feet 0 632541 2 1600 1 425618 5 2850 2 356471 3 1780 3 7412512 100 90000 Price Rooms Square_Feet 0 632541 2 1600 1 425618 5 2850 2 356471 3 1780 Price Rooms Square_Feet Outlier 0 632541 2 1600 0 1 425618 5 2850 0 2 356471 3 1780 0 3 7412512 100 90000 1 Price Rooms Square_Feet Outlier Log_Of_Square_Feet 0 632541 2 1600 0 7.377759 1 425618 5 2850 0 7.955074 2 356471 3 1780 0 7.484369 3 7412512 100 90000 1 11.407565

**
Download Materials
**

PySpark Project-Get a handle on using Python with Spark through this hands-on data processing spark python tutorial.

The goal of this data science project is to build a predictive model and find out the sales of each product at a given Big Mart store.

Data Science Project in R-Predict the sales for each department using historical markdown data from the Walmart dataset containing data of 45 Walmart stores.

In this data science project, you will predict borrowers chance of defaulting on credit loans by building a credit score prediction model.

Machine Learning Project in R- Predict the customer churn of telecom sector and find out the key drivers that lead to churn. Learn how the logistic regression model using R can be used to identify the customer churn in telecom dataset.

This project analyzes a dataset containing ecommerce product reviews. The goal is to use machine learning models to perform sentiment analysis on product reviews and rank them based on relevance. Reviews play a key role in product recommendation systems.

In this R data science project, we will explore wine dataset to assess red wine quality. The objective of this data science project is to explore which chemical properties will influence the quality of red wines.

In this machine learning project, you will uncover the predictive value in an uncertain world by using various artificial intelligence, machine learning, advanced regression and feature transformation techniques.

Data Science Project in Python- Given his or her job role, predict employee access needs using amazon employee database.

In this data science project in R, we are going to talk about subjective segmentation which is a clustering technique to find out product bundles in sales data.