How to Impute Missing Values with Mean in Python?

This recipe helps you impute missing values with mean in Python.

Recipe Objective - How to Impute Missing Values with Mean in Python? 

Sometimes datasets may contain missing values in various features, hindering our model's efficiency. In some cases, deleting the rows with null values may not be feasible as it may lead to data loss in other features. Therefore, we often need to impute missing values with a suitable technique to maintain the integrity of the dataset. One standard method is to impute missing values with means in Python. However, before we dive into this technique, let us first gain a deeper understanding of missing value imputation.

What is Missing Value Imputation?

Missing value imputation is the process of replacing missing data with estimates. The estimates can be derived from the remaining data, external sources, or a combination. Mean imputation is a simple method of imputation where missing values are replaced with the mean of the available data. This method assumes that the missing values are missing at random and that the data is normally distributed.

You don't have to remember all the machine learning algorithms by heart because of amazing libraries in Python. Work on these Machine Learning Projects in Python with code to know more!

Importance of Missing Value Imputation in Python 

Missing value imputation is an essential step in data preprocessing as it helps ensure analytical results' accuracy and reliability. Python provides several methods and libraries for missing value imputation, including mean, median, mode, and multiple imputations.

The importance of inputting missing values in Python can be summarized as follows:

  • Increased Accuracy: Missing data can lead to biased and accurate analytical results. By imputing missing values, we can reduce the bias in our data and obtain more accurate and reliable results.

  • Increased Data Availability: Sometimes, we need more data to analyze certain variables or samples. Imputing missing values can increase the amount of available data, which can help us to make more informed decisions.

  • Better Model Performance: Many machine learning algorithms require complete data to function correctly. By imputing missing values, we can improve the performance of these models and obtain more accurate predictions.

  • Simplified Data Analysis: Missing data can complicate data analysis by introducing additional complexities such as data manipulation and missing data handling. Imputing missing values can simplify the data analysis process and make it easier to draw meaningful insights from the data.

ProjectPro Free Projects on Big Data and Data Science

How to Impute Missing Values in Python? 

Imputing missing values is an essential step in data preprocessing as it can affect the accuracy of our analysis and models. Here's a step-by-step guide on how to impute null values in Python:

  1. Import Necessary Libraries: You'll need to import python libraries like Pandas and NumPy to work with data frames and arrays.

  2. Load Your Dataset: Load your dataset into a Pandas data frame.

  3. Identify Missing Values: Use the Pandas isnull() function to identify missing values in the data frame. This will return a Boolean data frame indicating the location of missing values.

  4. Determine the Imputation Strategy: Various strategies exist to impute missing values, such as mean, median, mode, or machine learning algorithms. Select the strategy that suits your data set and the purpose of analysis.

  5. Impute Missing Values: Once you've identified the missing values and the imputation strategy, use the fillna() function in Pandas to fill in the missing values.

  6. Check for Missing Values: After imputing, check again if any missing values are still present in the data frame.

  7. Save the Dataset: Once all missing values have been imputed, save the data frame into a new file or overwrite the original file.

How to Impute Missing Values in Python for Categorical Variables? 

There are several ways to impute nan values in Python for categorical variables:

  • Mode Imputation: You can replace the missing values in a categorical variable with that variable's mode (i.e., the most frequently occurring value). You can use the mode() function from the pandas library to do this. 

  • Missing Category Imputation: You can create a new category representing the missing values in a categorical variable. You can use the fillna() function from the pandas library to do this. 

  • Predictive Imputation: You can use machine learning algorithms to predict the missing values in a categorical variable based on the values of other variables in the dataset. This method is more complex but often yields better results than simple imputation methods. To do this, you can use libraries such as scikit-learn or XGBoost. 

How to Impute Missing Values with Mean in Python - A Step-by-Step Guide 

Imputing missing values with the mean is a common technique used in data preprocessing, and it involves replacing the missing values with the mean value of the corresponding feature/column. This technique is useful when the number of missing values is small, and imputing the mean value will not significantly alter the distribution of the feature.

Here is a step-by-step guide on how we can impute missing values with means in Python: 

Step 1 - Import the library

import pandas as pd import numpy as np from sklearn.preprocessing import Imputer

We have imported pandas, numpy and Imputer from sklearn.preprocessing.

Step 2 - Setting up the Data

We have created a empty DataFrame first then made columns C0 and C1 with the values. Clearly we can see that in column C1 three elements are nun. df = pd.DataFrame() df['C0'] = [0.2601,0.2358,0.1429,0.1259,0.7526, 0.7341,0.4546,0.1426,0.1490,0.2500] df['C1'] = [0.7154,np.nan,0.2615,0.5846,np.nan, 0.8308,0.4962,np.nan,0.5340,0.6731] print(df)

 

Explore More Data Science and Machine Learning Projects for Practice. Fast-Track Your Career Transition with ProjectPro

Step 3 - Using Imputer to fill the nun values with the Mean

We know that we have few nun values in column C1 so we have to fill it with the mean of remaining values of the column. So for this we will be using Imputer function, so let us first look into the parameters.

  • missing_values : In this we have to place the missing values and in pandas it is 'NaN'.
  • strategy : In this we have to pass the strategy that we need to follow to impute in missing value it can be mean, median, most_frequent or constant. By default it is mean.
  • fill_value : By default it is set as none. It is used when the strategy is set to constant then we have to pass the value that we want to fill as a constant in all the nun places.
  • axis : In this we have to pass 0 for columns and 1 for rows.

So we have created an object and called Imputer with the desired parameters. Then we have fit our dataframe and transformed its nun values with the mean and stored it in imputed_df. Then we have printed the final dataframe. miss_mean_imputer = Imputer(missing_values='NaN', strategy='mean', axis=0) miss_mean_imputer = miss_mean_imputer.fit(df) imputed_df = miss_mean_imputer.transform(df.values) print(imputed_df) Output as a dataset is given below, we can see that all the nun values have been filled by the mean of the columns.

       C0      C1
0  0.2601  0.7154
1  0.2358     NaN
2  0.1429  0.2615
3  0.1259  0.5846
4  0.7526     NaN
5  0.7341  0.8308
6  0.4546  0.4962
7  0.1426     NaN
8  0.1490  0.5340
9  0.2500  0.6731

[[0.2601     0.7154    ]
 [0.2358     0.58508571]
 [0.1429     0.2615    ]
 [0.1259     0.5846    ]
 [0.7526     0.58508571]
 [0.7341     0.8308    ]
 [0.4546     0.4962    ]
 [0.1426     0.58508571]
 [0.149      0.534     ]
 [0.25       0.6731    ]]

Let us further understand with an example of how to impute missing values with the mean in Python using the Pandas library:

import pandas as pd

import numpy as np

# create a sample dataframe with missing values

df = pd.DataFrame({'A': [1, 2, np.nan, 4, 5],

                   'B': [6, np.nan, 8, 9, 10],

                   'C': [11, 12, 13, np.nan, 15]})

print(df)

# output

#      A     B     C

# 0  1.0   6.0  11.0

# 1  2.0   NaN  12.0

# 2  NaN   8.0  13.0

# 3  4.0   9.0   NaN

# 4  5.0  10.0  15.0

# impute missing values with the mean

df.fillna(df.mean(), inplace=True)

print(df)

# output

#      A    B     C

# 0  1.0   6.0  11.0

# 1  2.0   7.67 12.0

# 2  3.0   8.0  13.0

# 3  4.0   9.0  12.75

# 4  5.0  10.0  15.0

Get More Practice, More Data Science and Machine Learning Projects, and More guidance. Fast-Track Your Career Transition with ProjectPro 

Alternative Methods for Imputing Missing Values in Python 

In addition to mean imputation, there are other methods for imputing missing values in Python, such as median imputation and mode imputation.

  • Impute Missing Values with Median in Python 

Median imputation replaces missing values with the median value of the non-missing values in the same column. This method is less sensitive to outliers than mean imputation because extreme values do not affect it. Median imputation is useful when the dataset has skewed distributions or extreme values that could affect the mean.

  • Impute Missing Values with Mode in Python

Mode imputation replaces missing values with the mode (most frequently occurring value) of the non-missing values in the same column. This method is helpful for categorical variables, where the mode represents the most common category. Mode imputation is inappropriate for continuous variables, as they typically do not have a mode. 

When choosing between these methods, it is essential to consider the variable type and the data distribution. Mean imputation is appropriate for variables with a normal distribution and no extreme values, while median imputation suits variables with skewed or extreme values. Mode imputation is appropriate for categorical variables with discrete values.

Advance Your Data Science Career by Working on Industry-Grade Projects 

Practical experience is essential to excel in the field of data science and land your dream job. One of the best ways to gain this experience is by working on industry-grade projects. These projects allow you to apply your knowledge to real-world problems and develop valuable skills that are highly sought after by employers.

If you're looking for a source of industry-grade projects to work on, look no further than ProjectPro. They offer a vast range of solved, end-to-end projects based on data science and big data, covering topics such as machine learning, data analysis, and data visualization.

By working on these projects, you'll gain practical experience and develop the skills and knowledge necessary to succeed in a data science career. So why wait? Explore the ProjectPro repository today and take your data science career to the next level!

FAQs 

Missing value imputation by mean is a technique used to fill in missing data in a dataset by replacing them with the mean value of the non-missing data in the same column.

To perform mean imputation in Python, you can use the fillna() method of a Pandas DataFrame and pass the mean value of the column as an argument.

You can impute missing values in a Python DataFrame using various techniques such as mean, median, mode, or using predictive models. You can use the fillna() method of a Pandas DataFrame to perform imputation.

To replace missing values with constants in Python, you can use the fillna() method of a Pandas DataFrame and pass the constant value as an argument.

To replace NaN values with mean in Python, you can use the fillna() method of a Pandas DataFrame and pass the mean value of
the column as an argument.

 

Join Millions of Satisfied Developers and Enterprises to Maximize Your Productivity and ROI with ProjectPro - Read ProjectPro Reviews Now!

Access Solved Big Data and Data Science Projects

Download Materials

What Users are saying..

profile image

Savvy Sahai

Data Science Intern, Capgemini
linkedin profile url

As a student looking to break into the field of data engineering and data science, one can get really confused as to which path to take. Very few ways to do it are Google, YouTube, etc. I was one of... Read More

Relevant Projects

Machine Learning project for Retail Price Optimization
In this machine learning pricing project, we implement a retail price optimization algorithm using regression trees. This is one of the first steps to building a dynamic pricing model.

Credit Card Default Prediction using Machine learning techniques
In this data science project, you will predict borrowers chance of defaulting on credit loans by building a credit score prediction model.

Walmart Sales Forecasting Data Science Project
Data Science Project in R-Predict the sales for each department using historical markdown data from the Walmart dataset containing data of 45 Walmart stores.

AWS MLOps Project to Deploy Multiple Linear Regression Model
Build and Deploy a Multiple Linear Regression Model in Python on AWS

Expedia Hotel Recommendations Data Science Project
In this data science project, you will contextualize customer data and predict the likelihood a customer will stay at 100 different hotel groups.

Build a Hybrid Recommender System in Python using LightFM
In this Recommender System project, you will build a hybrid recommender system in Python using LightFM .

Recommender System Machine Learning Project for Beginners-2
Recommender System Machine Learning Project for Beginners Part 2- Learn how to build a recommender system for market basket analysis using association rule mining.

OpenCV Project to Master Advanced Computer Vision Concepts
In this OpenCV project, you will learn to implement advanced computer vision concepts and algorithms in OpenCV library using Python.

Learn to Build a Polynomial Regression Model from Scratch
In this Machine Learning Regression project, you will learn to build a polynomial regression model to predict points scored by the sports team.

Predictive Analytics Project for Working Capital Optimization
In this Predictive Analytics Project, you will build a model to accurately forecast the timing of customer and supplier payments for optimizing working capital.