How to impute missing values with means in Python?

How to impute missing values with means in Python?

How to impute missing values with means in Python?

This recipe helps you impute missing values with means in Python


Recipe Objective

Some times we find few missing values in various features in a dataset. Our model can not work efficiently on nun values and in few cases removing the rows having null values can not be considered as an option because it leads to loss of data of other features.

So this is the recipe on How we can impute missing values with means in Python

Step 1 - Import the library

import pandas as pd import numpy as np from sklearn.preprocessing import Imputer

We have imported pandas, numpy and Imputer from sklearn.preprocessing.

Step 2 - Setting up the Data

We have created a empty DataFrame first then made columns C0 and C1 with the values. Clearly we can see that in column C1 three elements are nun. df = pd.DataFrame() df['C0'] = [0.2601,0.2358,0.1429,0.1259,0.7526, 0.7341,0.4546,0.1426,0.1490,0.2500] df['C1'] = [0.7154,np.nan,0.2615,0.5846,np.nan, 0.8308,0.4962,np.nan,0.5340,0.6731] print(df)

Step 3 - Using Imputer to fill the nun values with the Mean

We know that we have few nun values in column C1 so we have to fill it with the mean of remaining values of the column. So for this we will be using Imputer function, so let us first look into the parameters.

  • missing_values : In this we have to place the missing values and in pandas it is 'NaN'.
  • strategy : In this we have to pass the strategy that we need to follow to impute in missing value it can be mean, median, most_frequent or constant. By default it is mean.
  • fill_value : By default it is set as none. It is used when the strategy is set to constant then we have to pass the value that we want to fill as a constant in all the nun places.
  • axis : In this we have to pass 0 for columns and 1 for rows.
So we have created an object and called Imputer with the desired parameters. Then we have fit our dataframe and transformed its nun values with the mean and stored it in imputed_df. Then we have printed the final dataframe. miss_mean_imputer = Imputer(missing_values='NaN', strategy='mean', axis=0) miss_mean_imputer = imputed_df = miss_mean_imputer.transform(df.values) print(imputed_df) Output as a dataset is given below, we can see that all the nun values have been filled by the mean of the columns.

       C0      C1
0  0.2601  0.7154
1  0.2358     NaN
2  0.1429  0.2615
3  0.1259  0.5846
4  0.7526     NaN
5  0.7341  0.8308
6  0.4546  0.4962
7  0.1426     NaN
8  0.1490  0.5340
9  0.2500  0.6731

[[0.2601     0.7154    ]
 [0.2358     0.58508571]
 [0.1429     0.2615    ]
 [0.1259     0.5846    ]
 [0.7526     0.58508571]
 [0.7341     0.8308    ]
 [0.4546     0.4962    ]
 [0.1426     0.58508571]
 [0.149      0.534     ]
 [0.25       0.6731    ]]

Relevant Projects

Resume parsing with Machine learning - NLP with Python OCR and Spacy
In this machine learning resume parser example we use the popular Spacy NLP python library for OCR and text classification.

Predict Employee Computer Access Needs in Python
Data Science Project in Python- Given his or her job role, predict employee access needs using amazon employee database.

Choosing the right Time Series Forecasting Methods
There are different time series forecasting methods to forecast stock price, demand etc. In this machine learning project, you will learn to determine which forecasting method to be used when and how to apply with time series forecasting example.

Data Science Project on Wine Quality Prediction in R
In this R data science project, we will explore wine dataset to assess red wine quality. The objective of this data science project is to explore which chemical properties will influence the quality of red wines.

Walmart Sales Forecasting Data Science Project
Data Science Project in R-Predict the sales for each department using historical markdown data from the Walmart dataset containing data of 45 Walmart stores.

German Credit Dataset Analysis to Classify Loan Applications
In this data science project, you will work with German credit dataset using classification techniques like Decision Tree, Neural Networks etc to classify loan applications using R.

Ensemble Machine Learning Project - All State Insurance Claims Severity Prediction
In this ensemble machine learning project, we will predict what kind of claims an insurance company will get. This is implemented in python using ensemble machine learning algorithms.

Music Recommendation System Project using Python and R
Machine Learning Project - Work with KKBOX's Music Recommendation System dataset to build the best music recommendation engine.

Predict Churn for a Telecom company using Logistic Regression
Machine Learning Project in R- Predict the customer churn of telecom sector and find out the key drivers that lead to churn. Learn how the logistic regression model using R can be used to identify the customer churn in telecom dataset.

Predict Macro Economic Trends using Kaggle Financial Dataset
In this machine learning project, you will uncover the predictive value in an uncertain world by using various artificial intelligence, machine learning, advanced regression and feature transformation techniques.