How to impute missing values in a dataframe?

This recipe helps you impute missing values in a dataframe


Recipe Objective

Missing value is one of the most common problem in any raw dataset. To create a precise and unbiased machine learning model, we need to deal with these Missing values after identifying them. There are different steps that we can take to do so: ​

  1. Identifying number of missing values in each column
  2. Based on the number, we decide whether we need to drop the column or replace it with it's mean, median or any other computed value.

In this recipe, we will demonstrate how to impute missing values (NA) in a dataframe. ​

STEP 1: Creating a DataFrame

Creating a STUDENT dataframe with student_id, Name and marks as columns ​

STUDENT = data.frame(student_id = c(1,2,3,4,5), Name = c("Ram","Shyam", "Jessica", "Nisarg", "Daniel"), Marks = c(55, 60, NA, 70, NA))
student_id	Name	Marks
1		Ram	55
2		Shyam	60
3		Jessica	NA
4		Nisarg	70
5		Daniel	NA

STEP 2: Imputing missing values with mean of the respective column

First, we will use function to check whether the cell contains a missing value or not. Then, using mean() function to compute the mean value and imputing wherver the earlier function is true.

STUDENT$Marks[$Marks)] <- mean(STUDENT$Marks, na.rm=TRUE) STUDENT
student_id	Name	Marks
1		Ram	55.00000
2		Shyam	60.00000
3		Jessica	61.66667
4		Nisarg	70.00000
5		Daniel	61.66667

