How to impute missing values with means in Python?
DATA MUNGING DATA CLEANING PYTHON MACHINE LEARNING RECIPES PANDAS CHEATSHEET     ALL TAGS

How to impute missing values with means in Python?

How to impute missing values with means in Python?

This recipe helps you impute missing values with means in Python

0

Recipe Objective

Some times we find few missing values in various features in a dataset. Our model can not work efficiently on nun values and in few cases removing the rows having null values can not be considered as an option because it leads to loss of data of other features.

So this is the recipe on How we can impute missing values with means in Python

Step 1 - Import the library

import pandas as pd import numpy as np from sklearn.preprocessing import Imputer

We have imported pandas, numpy and Imputer from sklearn.preprocessing.

Step 2 - Setting up the Data

We have created a empty DataFrame first then made columns C0 and C1 with the values. Clearly we can see that in column C1 three elements are nun. df = pd.DataFrame() df['C0'] = [0.2601,0.2358,0.1429,0.1259,0.7526, 0.7341,0.4546,0.1426,0.1490,0.2500] df['C1'] = [0.7154,np.nan,0.2615,0.5846,np.nan, 0.8308,0.4962,np.nan,0.5340,0.6731] print(df)

Step 3 - Using Imputer to fill the nun values with the Mean

We know that we have few nun values in column C1 so we have to fill it with the mean of remaining values of the column. So for this we will be using Imputer function, so let us first look into the parameters.

  • missing_values : In this we have to place the missing values and in pandas it is 'NaN'.
  • strategy : In this we have to pass the strategy that we need to follow to impute in missing value it can be mean, median, most_frequent or constant. By default it is mean.
  • fill_value : By default it is set as none. It is used when the strategy is set to constant then we have to pass the value that we want to fill as a constant in all the nun places.
  • axis : In this we have to pass 0 for columns and 1 for rows.
So we have created an object and called Imputer with the desired parameters. Then we have fit our dataframe and transformed its nun values with the mean and stored it in imputed_df. Then we have printed the final dataframe. miss_mean_imputer = Imputer(missing_values='NaN', strategy='mean', axis=0) miss_mean_imputer = miss_mean_imputer.fit(df) imputed_df = miss_mean_imputer.transform(df.values) print(imputed_df) Output as a dataset is given below, we can see that all the nun values have been filled by the mean of the columns.

       C0      C1
0  0.2601  0.7154
1  0.2358     NaN
2  0.1429  0.2615
3  0.1259  0.5846
4  0.7526     NaN
5  0.7341  0.8308
6  0.4546  0.4962
7  0.1426     NaN
8  0.1490  0.5340
9  0.2500  0.6731

[[0.2601     0.7154    ]
 [0.2358     0.58508571]
 [0.1429     0.2615    ]
 [0.1259     0.5846    ]
 [0.7526     0.58508571]
 [0.7341     0.8308    ]
 [0.4546     0.4962    ]
 [0.1426     0.58508571]
 [0.149      0.534     ]
 [0.25       0.6731    ]]

Relevant Projects

Sequence Classification with LSTM RNN in Python with Keras
In this project, we are going to work on Sequence to Sequence Prediction using IMDB Movie Review Dataset​ using Keras in Python.

Customer Churn Prediction Analysis using Ensemble Techniques
In this machine learning churn project, we implement a churn prediction model in python using ensemble techniques.

Music Recommendation System Project using Python and R
Machine Learning Project - Work with KKBOX's Music Recommendation System dataset to build the best music recommendation engine.

Walmart Sales Forecasting Data Science Project
Data Science Project in R-Predict the sales for each department using historical markdown data from the Walmart dataset containing data of 45 Walmart stores.

Mercari Price Suggestion Challenge Data Science Project
Data Science Project in Python- Build a machine learning algorithm that automatically suggests the right product prices.

Build an Image Classifier for Plant Species Identification
In this machine learning project, we will use binary leaf images and extracted features, including shape, margin, and texture to accurately identify plant species using different benchmark classification techniques.

Demand prediction of driver availability using multistep time series analysis
In this supervised learning machine learning project, you will predict the availability of a driver in a specific area by using multi step time series analysis.

Topic modelling using Kmeans clustering to group customer reviews
In this Kmeans clustering machine learning project, you will perform topic modelling in order to group customer reviews based on recurring patterns.

Loan Eligibility Prediction using Gradient Boosting Classifier
This data science in python project predicts if a loan should be given to an applicant or not. We predict if the customer is eligible for loan based on several factors like credit score and past history.

Resume parsing with Machine learning - NLP with Python OCR and Spacy
In this machine learning resume parser example we use the popular Spacy NLP python library for OCR and text classification.