How to Impute Missing Values with Mean in Python?

This recipe helps you impute missing values with mean in Python.
Last Updated: 12 Apr 2023

Get access to Data Science projects View all Data Science projects

DATA MUNGING DATA CLEANING PYTHON MACHINE LEARNING RECIPES PANDAS CHEATSHEET ALL TAGS

Recipe Objective - How to Impute Missing Values with Mean in Python?
What is Missing Value Imputation?
Importance of Missing Value Imputation in Python
How to Impute Missing Values in Python?
How to Impute Missing Values in Python for Categorical Variables?
How to Impute Missing Values with Mean in Python - A Step-by-Step Guide
Alternative Methods for Imputing Missing Values in Python
Advance Your Data Science Career by Working on Industry-Grade Projects
FAQs

Recipe Objective - How to Impute Missing Values with Mean in Python?

Sometimes datasets may contain missing values in various features, hindering our model's efficiency. In some cases, deleting the rows with null values may not be feasible as it may lead to data loss in other features. Therefore, we often need to impute missing values with a suitable technique to maintain the integrity of the dataset. One standard method is to impute missing values with means in Python. However, before we dive into this technique, let us first gain a deeper understanding of missing value imputation.

What is Missing Value Imputation?

Missing value imputation is the process of replacing missing data with estimates. The estimates can be derived from the remaining data, external sources, or a combination. Mean imputation is a simple method of imputation where missing values are replaced with the mean of the available data. This method assumes that the missing values are missing at random and that the data is normally distributed.

You don't have to remember all the machine learning algorithms by heart because of amazing libraries in Python. Work on these Machine Learning Projects in Python with code to know more!

Importance of Missing Value Imputation in Python

Missing value imputation is an essential step in data preprocessing as it helps ensure analytical results' accuracy and reliability. Python provides several methods and libraries for missing value imputation, including mean, median, mode, and multiple imputations.

The importance of inputting missing values in Python can be summarized as follows:

Increased Accuracy: Missing data can lead to biased and accurate analytical results. By imputing missing values, we can reduce the bias in our data and obtain more accurate and reliable results.
Increased Data Availability: Sometimes, we need more data to analyze certain variables or samples. Imputing missing values can increase the amount of available data, which can help us to make more informed decisions.
Better Model Performance: Many machine learning algorithms require complete data to function correctly. By imputing missing values, we can improve the performance of these models and obtain more accurate predictions.
Simplified Data Analysis: Missing data can complicate data analysis by introducing additional complexities such as data manipulation and missing data handling. Imputing missing values can simplify the data analysis process and make it easier to draw meaningful insights from the data.

How to Impute Missing Values in Python?

Imputing missing values is an essential step in data preprocessing as it can affect the accuracy of our analysis and models. Here's a step-by-step guide on how to impute null values in Python:

Import Necessary Libraries: You'll need to import python libraries like Pandas and NumPy to work with data frames and arrays.
Load Your Dataset: Load your dataset into a Pandas data frame.
Identify Missing Values: Use the Pandas isnull() function to identify missing values in the data frame. This will return a Boolean data frame indicating the location of missing values.
Determine the Imputation Strategy: Various strategies exist to impute missing values, such as mean, median, mode, or machine learning algorithms. Select the strategy that suits your data set and the purpose of analysis.
Impute Missing Values: Once you've identified the missing values and the imputation strategy, use the fillna() function in Pandas to fill in the missing values.
Check for Missing Values: After imputing, check again if any missing values are still present in the data frame.
Save the Dataset: Once all missing values have been imputed, save the data frame into a new file or overwrite the original file.

How to Impute Missing Values in Python for Categorical Variables?

There are several ways to impute nan values in Python for categorical variables:

Mode Imputation: You can replace the missing values in a categorical variable with that variable's mode (i.e., the most frequently occurring value). You can use the mode() function from the pandas library to do this.
Missing Category Imputation: You can create a new category representing the missing values in a categorical variable. You can use the fillna() function from the pandas library to do this.
Predictive Imputation: You can use machine learning algorithms to predict the missing values in a categorical variable based on the values of other variables in the dataset. This method is more complex but often yields better results than simple imputation methods. To do this, you can use libraries such as scikit-learn or XGBoost.

How to Impute Missing Values with Mean in Python - A Step-by-Step Guide

Imputing missing values with the mean is a common technique used in data preprocessing, and it involves replacing the missing values with the mean value of the corresponding feature/column. This technique is useful when the number of missing values is small, and imputing the mean value will not significantly alter the distribution of the feature.

Here is a step-by-step guide on how we can impute missing values with means in Python:

Step 1 - Import the library

import pandas as pd import numpy as np from sklearn.preprocessing import Imputer

We have imported pandas, numpy and Imputer from sklearn.preprocessing.

Step 2 - Setting up the Data

We have created a empty DataFrame first then made columns C0 and C1 with the values. Clearly we can see that in column C1 three elements are nun. df = pd.DataFrame() df['C0'] = [0.2601,0.2358,0.1429,0.1259,0.7526, 0.7341,0.4546,0.1426,0.1490,0.2500] df['C1'] = [0.7154,np.nan,0.2615,0.5846,np.nan, 0.8308,0.4962,np.nan,0.5340,0.6731] print(df)

Explore More Data Science and Machine Learning Projects for Practice. Fast-Track Your Career Transition with ProjectPro

Step 3 - Using Imputer to fill the nun values with the Mean

We know that we have few nun values in column C1 so we have to fill it with the mean of remaining values of the column. So for this we will be using Imputer function, so let us first look into the parameters.

missing_values : In this we have to place the missing values and in pandas it is 'NaN'.
strategy : In this we have to pass the strategy that we need to follow to impute in missing value it can be mean, median, most_frequent or constant. By default it is mean.
fill_value : By default it is set as none. It is used when the strategy is set to constant then we have to pass the value that we want to fill as a constant in all the nun places.
axis : In this we have to pass 0 for columns and 1 for rows.

So we have created an object and called Imputer with the desired parameters. Then we have fit our dataframe and transformed its nun values with the mean and stored it in imputed_df. Then we have printed the final dataframe. miss_mean_imputer = Imputer(missing_values='NaN', strategy='mean', axis=0) miss_mean_imputer = miss_mean_imputer.fit(df) imputed_df = miss_mean_imputer.transform(df.values) print(imputed_df) Output as a dataset is given below, we can see that all the nun values have been filled by the mean of the columns.

       C0      C1
0  0.2601  0.7154
1  0.2358     NaN
2  0.1429  0.2615
3  0.1259  0.5846
4  0.7526     NaN
5  0.7341  0.8308
6  0.4546  0.4962
7  0.1426     NaN
8  0.1490  0.5340
9  0.2500  0.6731

[[0.2601     0.7154    ]
 [0.2358     0.58508571]
 [0.1429     0.2615    ]
 [0.1259     0.5846    ]
 [0.7526     0.58508571]
 [0.7341     0.8308    ]
 [0.4546     0.4962    ]
 [0.1426     0.58508571]
 [0.149      0.534     ]
 [0.25       0.6731    ]]

Let us further understand with an example of how to impute missing values with the mean in Python using the Pandas library:

import pandas as pd

import numpy as np

# create a sample dataframe with missing values

df = pd.DataFrame({'A': [1, 2, np.nan, 4, 5],

'B': [6, np.nan, 8, 9, 10],

'C': [11, 12, 13, np.nan, 15]})

print(df)

# output

# A B C

# 0 1.0 6.0 11.0

# 1 2.0 NaN 12.0

# 2 NaN 8.0 13.0

# 3 4.0 9.0 NaN

# 4 5.0 10.0 15.0

# impute missing values with the mean

df.fillna(df.mean(), inplace=True)

print(df)

# output

# A B C

# 0 1.0 6.0 11.0

# 1 2.0 7.67 12.0

# 2 3.0 8.0 13.0

# 3 4.0 9.0 12.75

# 4 5.0 10.0 15.0

Get More Practice, More Data Science and Machine Learning Projects, and More guidance. Fast-Track Your Career Transition with ProjectPro

Alternative Methods for Imputing Missing Values in Python

In addition to mean imputation, there are other methods for imputing missing values in Python, such as median imputation and mode imputation.

Impute Missing Values with Median in Python

Median imputation replaces missing values with the median value of the non-missing values in the same column. This method is less sensitive to outliers than mean imputation because extreme values do not affect it. Median imputation is useful when the dataset has skewed distributions or extreme values that could affect the mean.

Impute Missing Values with Mode in Python

Mode imputation replaces missing values with the mode (most frequently occurring value) of the non-missing values in the same column. This method is helpful for categorical variables, where the mode represents the most common category. Mode imputation is inappropriate for continuous variables, as they typically do not have a mode.

When choosing between these methods, it is essential to consider the variable type and the data distribution. Mean imputation is appropriate for variables with a normal distribution and no extreme values, while median imputation suits variables with skewed or extreme values. Mode imputation is appropriate for categorical variables with discrete values.

Advance Your Data Science Career by Working on Industry-Grade Projects

Practical experience is essential to excel in the field of data science and land your dream job. One of the best ways to gain this experience is by working on industry-grade projects. These projects allow you to apply your knowledge to real-world problems and develop valuable skills that are highly sought after by employers.

If you're looking for a source of industry-grade projects to work on, look no further than ProjectPro. They offer a vast range of solved, end-to-end projects based on data science and big data, covering topics such as machine learning, data analysis, and data visualization.

By working on these projects, you'll gain practical experience and develop the skills and knowledge necessary to succeed in a data science career. So why wait? Explore the ProjectPro repository today and take your data science career to the next level!

FAQs

1. What is missing value imputation by mean?

Missing value imputation by mean is a technique used to fill in missing data in a dataset by replacing them with the mean value of the non-missing data in the same column.

2. How do you mean imputation in Python?

To perform mean imputation in Python, you can use the fillna() method of a Pandas DataFrame and pass the mean value of the column as an argument.

3. How do you impute missing values in a DataFrame in Python?

You can impute missing values in a Python DataFrame using various techniques such as mean, median, mode, or using predictive models. You can use the fillna() method of a Pandas DataFrame to perform imputation.

4. How do you replace missing values with constants in Python?

To replace missing values with constants in Python, you can use the fillna() method of a Pandas DataFrame and pass the constant value as an argument.

5. How do you replace Na value with mean in Python?

To replace NaN values with mean in Python, you can use the fillna() method of a Pandas DataFrame and pass the mean value of
the column as an argument.

Join Millions of Satisfied Developers and Enterprises to Maximize Your Productivity and ROI with ProjectPro - Read ProjectPro Reviews Now!

Download Materials

iPython Notebook

What Users are saying..

Anand Kumpatla

Sr Data Scientist @ Doubleslash Software Solutions Pvt Ltd

ProjectPro is a unique platform and helps many people in the industry to solve real-life problems with a step-by-step walkthrough of projects. A platform with some fantastic resources to gain... Read More

Relevant Projects

Machine Learning Projects

Data Science Projects

Python Projects for Data Science

Data Science Projects in R

Machine Learning Projects for Beginners

Deep Learning Projects

Neural Network Projects

Tensorflow Projects

NLP Projects

Kaggle Projects

IoT Projects

Big Data Projects

Hadoop Real-Time Projects Examples

Spark Projects

Data Analytics Projects for Students

Relevant Projects

Build CI/CD Pipeline for Machine Learning Projects using Jenkins

In this project, you will learn how to create a CI/CD pipeline for a search engine application using Jenkins.

View Project Details

Personalized Medicine: Redefining Cancer Treatment

In this Personalized Medicine Machine Learning Project you will learn to classify genetic mutations on the basis of medical literature into 9 classes.

View Project Details

ML Model Deployment on AWS for Customer Churn Prediction

MLOps Project-Deploy Machine Learning Model to Production Python on AWS for Customer Churn Prediction

View Project Details

Deploy Transformer BART Model for Text summarization on GCP

Learn to Deploy a Machine Learning Model for the Abstractive Text Summarization on Google Cloud Platform (GCP)

View Project Details

Learn How to Build a Linear Regression Model in PyTorch

In this Machine Learning Project, you will learn how to build a simple linear regression model in PyTorch to predict the number of days subscribed.

View Project Details

Classification Projects on Machine Learning for Beginners - 2

Learn to implement various ensemble techniques to predict license status for a given business.

View Project Details

Ola Bike Rides Request Demand Forecast

Given big data at taxi service (ride-hailing) i.e. OLA, you will learn multi-step time series forecasting and clustering with Mini-Batch K-means Algorithm on geospatial data to predict future ride requests for a particular region at a given time.

View Project Details

NLP Project for Multi Class Text Classification using BERT Model

In this NLP Project, you will learn how to build a multi-class text classification model using using the pre-trained BERT model.

View Project Details

Topic modelling using Kmeans clustering to group customer reviews

In this Kmeans clustering machine learning project, you will perform topic modelling in order to group customer reviews based on recurring patterns.

View Project Details

Langchain Project for Customer Support App in Python

In this LLM Project, you will learn how to enhance customer support interactions through Large Language Models (LLMs), enabling intelligent, context-aware responses. This Langchain project aims to seamlessly integrate LLM technology with databases, PDF knowledge bases, and audio processing agents to create a comprehensive customer support application.

View Project Details

How to Impute Missing Values with Mean in Python?

Table of Contents

Recipe Objective - How to Impute Missing Values with Mean in Python?

What is Missing Value Imputation?

Importance of Missing Value Imputation in Python

How to Impute Missing Values in Python?

How to Impute Missing Values in Python for Categorical Variables?

How to Impute Missing Values with Mean in Python - A Step-by-Step Guide

Step 1 - Import the library

Step 2 - Setting up the Data

Step 3 - Using Imputer to fill the nun values with the Mean

Alternative Methods for Imputing Missing Values in Python

Advance Your Data Science Career by Working on Industry-Grade Projects

FAQs

1. What is missing value imputation by mean?

2. How do you mean imputation in Python?

3. How do you impute missing values in a DataFrame in Python?

4. How do you replace missing values with constants in Python?

5. How do you replace Na value with mean in Python?

Anand Kumpatla

Relevant Projects

You might also like

Relevant Projects