How to find outliers in Python?
DATA MUNGING DATA CLEANING PYTHON MACHINE LEARNING RECIPES PANDAS CHEATSHEET     ALL TAGS

How to find outliers in Python?

How to find outliers in Python?

This recipe helps you find outliers in Python

0

Recipe Objective

Do you know few values in dataset are considered as outliers, there are the data values which donot comes in the range of data i.e. some values that is very small or large. They effect the model very badly so we need to remove the outlier.

So this is the recipe on we can find outliers in Python.

Step 1 - Import the library

from sklearn.covariance import EllipticEnvelope from sklearn.datasets import make_blobs

We have imported EllipticEnvelop and make_blobs which is needed.

Step 2 - Setting up the Data

We have created a dataset using make_blobs and we will remove outliers from this. X, _ = make_blobs(n_samples = 100, n_features = 20, centers = 7, cluster_std = 1.1, shuffle = True, random_state = 42)

Step 3 - Removing Outliers

We are training the EllipticEnvelope with parameter contamination which signifies the amount of data that is to be removed as outiers. We have predicted the output that is the data without outliers. outlier_detector = EllipticEnvelope(contamination=.1) outlier_detector.fit(X) print(X) print(outlier_detector.predict(X)) So the output comes as

[[ 4.93252797  7.68541287 -3.97876821 ...  4.52684633 -3.24863123
   9.41974416]
 [-9.3234536   4.59276437 -4.39779468 ... -7.09597087  8.20227193
   2.26134033]
 [-8.7338198   3.08658417 -3.49905765 ... -6.82385124  8.775862
   1.38825176]
 ...
 [-2.83969517 -6.07980264  6.47763993 ... -9.36607752 -2.57352093
  -9.39410402]
 [-2.1671993  10.63717797  5.58330442 ...  0.50898027 -1.25365592
  -5.02572796]
 [ 7.21074034  9.28156979 -3.54240715 ...  3.89782083 -3.2259812
  11.03335594]]

[ 1 -1  1 -1  1  1 -1  1  1  1  1  1  1 -1  1  1  1  1  1  1  1  1  1  1
  1  1 -1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1 -1  1  1  1
  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1 -1  1  1  1  1  1  1  1  1
  1  1 -1  1  1  1  1  1  1  1  1 -1  1  1  1  1  1 -1  1  1  1  1  1  1
  1  1  1  1]

Relevant Projects

Human Activity Recognition Using Smartphones Data Set
In this deep learning project, you will build a classification system where to precisely identify human fitness activities.

Predict Churn for a Telecom company using Logistic Regression
Machine Learning Project in R- Predict the customer churn of telecom sector and find out the key drivers that lead to churn. Learn how the logistic regression model using R can be used to identify the customer churn in telecom dataset.

Zillow’s Home Value Prediction (Zestimate)
Data Science Project in R -Build a machine learning algorithm to predict the future sale prices of homes.

Data Science Project - Instacart Market Basket Analysis
Data Science Project - Build a recommendation engine which will predict the products to be purchased by an Instacart consumer again.

Solving Multiple Classification use cases Using H2O
In this project, we are going to talk about H2O and functionality in terms of building Machine Learning models.

Walmart Sales Forecasting Data Science Project
Data Science Project in R-Predict the sales for each department using historical markdown data from the Walmart dataset containing data of 45 Walmart stores.

Resume parsing with Machine learning - NLP with Python OCR and Spacy
In this machine learning resume parser example we use the popular Spacy NLP python library for OCR and text classification.

Forecast Inventory demand using historical sales data in R
In this machine learning project, you will develop a machine learning model to accurately forecast inventory demand based on historical sales data.

Customer Churn Prediction Analysis using Ensemble Techniques
In this machine learning churn project, we implement a churn prediction model in python using ensemble techniques.

Time Series Forecasting with LSTM Neural Network Python
Deep Learning Project- Learn to apply deep learning paradigm to forecast univariate time series data.