How to Find Outliers in Python? Method and Examples

This tutorial will help you understand the outlier detection process in Python, covered with step by step guidance and clear examples. | ProjectPro

In data analysis, outliers are data points that significantly deviate from the rest of the data. These anomalies can distort statistical analyses if not properly handled, leading to misleading interpretations. Therefore, detecting outliers is crucial for ensuring the integrity and accuracy of your analyses. This tutorial will explore various methods to detect outliers in Python and provide illustrative examples.

How to Find Outliers in Python? 

Several methods offer different approaches to identifying outliers in Python. Depending on the dataset and specific requirements, you may choose one or a combination of these techniques. Check them out below - 

Visual inspection involves plotting the data and identifying any points that appear to be outliers. Standard plots include histograms, box plots, and scatter plots. Seaborn and Matplotlib are popular libraries for creating such visualizations.

Outlier detection method in Python - Visualization Techniques

The Z-score method identifies outliers by calculating how many standard deviations a data point is from the mean. Typically, a threshold of 3 standard deviations is used to identify outliers.

Z-score method - outlier detection method

The IQR method defines outliers as data points that fall below Q1 - 1.5 * IQR or above Q3 + 1.5 * IQR, where Q1 and Q3 are the first and third quartiles, respectively, and IQR is the interquartile range. 

IQR Method - find outlier in Python

Isolation Forest is an unsupervised learning algorithm that isolates outliers by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of the selected feature.

Isolation forest method - calculate outlier in Python

Example to Check Outliers in Python 

Step 1 - Import the library

    from sklearn.covariance import EllipticEnvelope

    from sklearn.datasets import make_blobs

We have imported EllipticEnvelop and make_blobs, which are needed.

Step 2 - Setting up the Data

We have created a dataset using make_blobs, and we will remove outliers from it.

     X, _ = make_blobs(n_samples = 100,

                      n_features = 20,

                      centers = 7,

                      cluster_std = 1.1,

                      shuffle = True,

                      random_state = 42)

Step 3 - Removing Outliers

We are training the EllipticEnvelope with parameter contamination, which signifies how much data can be removed as outliers. We have predicted the output, which is the data without outliers.

    outlier_detector = EllipticEnvelope(contamination=.1)

    outlier_detector.fit(X)

    print(X)

    print(outlier_detector.predict(X))

So the output comes as

[[ 4.93252797  7.68541287 -3.97876821 ...  4.52684633 -3.24863123

   9.41974416]

 [-9.3234536   4.59276437 -4.39779468 ... -7.09597087  8.20227193

   2.26134033]

 [-8.7338198   3.08658417 -3.49905765 ... -6.82385124  8.775862

   1.38825176]

 ...

 [-2.83969517 -6.07980264  6.47763993 ... -9.36607752 -2.57352093

  -9.39410402]

 [-2.1671993  10.63717797  5.58330442 ...  0.50898027 -1.25365592

  -5.02572796]

 [ 7.21074034  9.28156979 -3.54240715 ...  3.89782083 -3.2259812

  11.03335594]]

 

[ 1 -1  1 -1  1  1 -1  1  1  1  1  1  1 -1  1  1  1  1  1  1  1  1  1  1

  1  1 -1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1 -1  1  1  1

  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1 -1  1  1  1  1  1  1  1  1

  1  1 -1  1  1  1  1  1  1  1  1 -1  1  1  1  1  1 -1  1  1  1  1  1  1

  1  1  1  1]

 

Master Python Skills with ProjectPro! 

Mastering Python for data analysis involves understanding its syntax and libraries and gaining practical experience through real-world projects. Identifying outliers is a crucial aspect of data analysis, and Python offers various methods to accomplish this task efficiently. By applying the techniques discussed in this tutorial, you can effectively detect outliers in your datasets and make informed decisions based on reliable insights. However, theoretical knowledge alone is insufficient; hands-on experience is essential for honing your skills. That's where ProjectPro comes in. With its extensive repository of over 270+ projects in data science and big data, ProjectPro offers a unique opportunity to apply your Python skills in practical scenarios, further solidifying your understanding and proficiency. So, start your journey to mastering Python with ProjectPro.

Download Materials

What Users are saying..

profile image

Ed Godalle

Director Data Analytics at EY / EY Tech
linkedin profile url

I am the Director of Data Analytics with over 10+ years of IT experience. I have a background in SQL, Python, and Big Data working with Accenture, IBM, and Infosys. I am looking to enhance my skills... Read More

Relevant Projects

Multilabel Classification Project for Predicting Shipment Modes
Multilabel Classification Project to build a machine learning model that predicts the appropriate mode of transport for each shipment, using a transport dataset with 2000 unique products. The project explores and compares four different approaches to multilabel classification, including naive independent models, classifier chains, natively multilabel models, and multilabel to multiclass approaches.

AWS MLOps Project for Gaussian Process Time Series Modeling
MLOps Project to Build and Deploy a Gaussian Process Time Series Model in Python on AWS

Build a Text Generator Model using Amazon SageMaker
In this Deep Learning Project, you will train a Text Generator Model on Amazon Reviews Dataset using LSTM Algorithm in PyTorch and deploy it on Amazon SageMaker.

Deploy Transformer-BART Model on Paperspace Cloud
In this MLOps Project you will learn how to deploy a Tranaformer BART Model for Abstractive Text Summarization on Paperspace Private Cloud

Build CI/CD Pipeline for Machine Learning Projects using Jenkins
In this project, you will learn how to create a CI/CD pipeline for a search engine application using Jenkins.

Ecommerce product reviews - Pairwise ranking and sentiment analysis
This project analyzes a dataset containing ecommerce product reviews. The goal is to use machine learning models to perform sentiment analysis on product reviews and rank them based on relevance. Reviews play a key role in product recommendation systems.

GCP MLOps Project to Deploy ARIMA Model using uWSGI Flask
Build an end-to-end MLOps Pipeline to deploy a Time Series ARIMA Model on GCP using uWSGI and Flask

Word2Vec and FastText Word Embedding with Gensim in Python
In this NLP Project, you will learn how to use the popular topic modelling library Gensim for implementing two state-of-the-art word embedding methods Word2Vec and FastText models.

Classification Projects on Machine Learning for Beginners - 1
Classification ML Project for Beginners - A Hands-On Approach to Implementing Different Types of Classification Algorithms in Machine Learning for Predictive Modelling

NLP Project on LDA Topic Modelling Python using RACE Dataset
Use the RACE dataset to extract a dominant topic from each document and perform LDA topic modeling in python.