Missing value is one of the most common problem in any raw dataset. To create a precise and unbiased machine learning model, we need to deal with these Missing values after identifying them. There are different steps that we can take to do so:
In this recipe, we will demonstrate how to impute missing values (NA) in a dataframe.
Creating a STUDENT dataframe with student_id, Name and marks as columns
STUDENT = data.frame(student_id = c(1,2,3,4,5), Name = c("Ram","Shyam", "Jessica", "Nisarg", "Daniel"), Marks = c(55, 60, NA, 70, NA))
student_id Name Marks 1 Ram 55 2 Shyam 60 3 Jessica NA 4 Nisarg 70 5 Daniel NA
First, we will use is.na() function to check whether the cell contains a missing value or not. Then, using mean() function to compute the mean value and imputing wherver the earlier function is true.
STUDENT$Marks[is.na(STUDENT$Marks)] <- mean(STUDENT$Marks, na.rm=TRUE) STUDENT
student_id Name Marks 1 Ram 55.00000 2 Shyam 60.00000 3 Jessica 61.66667 4 Nisarg 70.00000 5 Daniel 61.66667