How to deal with missing values in a Pandas DataFrame?

This recipe helps you deal with missing values in a Pandas DataFrame

Recipe Objective

In a dataset its very normal that we can get missing values and we can not use that missing values in models. So how to deal with missing values.

So this is the recipe on how we can deal with missing values in a Pandas DataFrame.

Step 1 - Import the library

import pandas as pd import numpy as np

We have imported numpy and pandas which will be needed for the dataset.

Step 2 - Setting up the Data

We have created a dataframe with different features like "first_name", "last_name", "age", "comedy_score" and "Rating_Score". raw_data = {"first_name": ["Sheldon", "Raj", "Leonard", "Howard", "Amy"], "last_name": ["Copper", "Koothrappali", "Hofstadter", "Wolowitz", "Fowler"], "age": [42, 38, np.nan, 41, 35], "Comedy_Score": [9, 7, np.nan, 8, 5], "Rating_Score": [25, 25, 49, np.nan, 70]} df = pd.DataFrame(raw_data, columns = ["first_name", "last_name", "age", "Comedy_Score", "Rating_Score"]) print(df)

Step 3 - Dealing with missing values

Here we will be using different methods to deal with missing values.

    • Droping missing observations

df_no_missing = df.dropna() print(df_no_missing)

    • Droping rows where all cells in that row is NA

df_cleaned = df.dropna(how="all") print(df_cleaned)

    • Creating a new column full of missing values

df3 = df.bfill(); print(df3)

    • Creating a new column full of missing values

df["location"] = np.nan print(df)

    • Droping column if they only contain missing values

print(df.dropna(axis=1, how="all"))

    • Droping rows that contain less than five observations

print(df.dropna(thresh=5))

    • Filling in missing data with zeros

print(df.fillna(0))

    • Filling in missing in Comedy_Score with the mean value of Comedy_Score

df["Comedy_Score"].fillna(df["Comedy_Score"].mean(), inplace=True) print(df)

    • Filling in missing in Comedy_Score with each age’s mean value of Comedy_Score

df["Comedy_Score"].fillna(df.groupby("age")["Comedy_Score"].transform("mean"), inplace=True) print(df)

    • Selecting the rows of df where age is not NaN and age is not NaN

print(df[df["age"].notnull() & df["Rating_Score"].notnull()]) print(df[df["age"].notnull() & df["Rating_Score"].notnull()].fillna(0))

So the output comes as:

  first_name     last_name   age  Comedy_Score  Rating_Score
0    Sheldon        Copper  42.0           9.0          25.0
1        Raj  Koothrappali  38.0           7.0          25.0
2    Leonard    Hofstadter   NaN           NaN          49.0
3     Howard      Wolowitz  41.0           8.0           NaN
4        Amy        Fowler  35.0           5.0          70.0

  first_name     last_name   age  Comedy_Score  Rating_Score
0    Sheldon        Copper  42.0           9.0          25.0
1        Raj  Koothrappali  38.0           7.0          25.0
4        Amy        Fowler  35.0           5.0          70.0

  first_name     last_name   age  Comedy_Score  Rating_Score
0    Sheldon        Copper  42.0           9.0          25.0
1        Raj  Koothrappali  38.0           7.0          25.0
2    Leonard    Hofstadter   NaN           NaN          49.0
3     Howard      Wolowitz  41.0           8.0           NaN
4        Amy        Fowler  35.0           5.0          70.0

  first_name     last_name   age  Comedy_Score  Rating_Score  location
0    Sheldon        Copper  42.0           9.0          25.0       NaN
1        Raj  Koothrappali  38.0           7.0          25.0       NaN
2    Leonard    Hofstadter   NaN           NaN          49.0       NaN
3     Howard      Wolowitz  41.0           8.0           NaN       NaN
4        Amy        Fowler  35.0           5.0          70.0       NaN

  first_name     last_name   age  Comedy_Score  Rating_Score
0    Sheldon        Copper  42.0           9.0          25.0
1        Raj  Koothrappali  38.0           7.0          25.0
2    Leonard    Hofstadter   NaN           NaN          49.0
3     Howard      Wolowitz  41.0           8.0           NaN
4        Amy        Fowler  35.0           5.0          70.0

  first_name     last_name   age  Comedy_Score  Rating_Score  location
0    Sheldon        Copper  42.0           9.0          25.0       NaN
1        Raj  Koothrappali  38.0           7.0          25.0       NaN
4        Amy        Fowler  35.0           5.0          70.0       NaN

  first_name     last_name   age  Comedy_Score  Rating_Score  location
0    Sheldon        Copper  42.0           9.0          25.0       0.0
1        Raj  Koothrappali  38.0           7.0          25.0       0.0
2    Leonard    Hofstadter   0.0           0.0          49.0       0.0
3     Howard      Wolowitz  41.0           8.0           0.0       0.0
4        Amy        Fowler  35.0           5.0          70.0       0.0

  first_name     last_name   age  Comedy_Score  Rating_Score  location
0    Sheldon        Copper  42.0          9.00          25.0       NaN
1        Raj  Koothrappali  38.0          7.00          25.0       NaN
2    Leonard    Hofstadter   NaN          7.25          49.0       NaN
3     Howard      Wolowitz  41.0          8.00           NaN       NaN
4        Amy        Fowler  35.0          5.00          70.0       NaN

  first_name     last_name   age  Comedy_Score  Rating_Score  location
0    Sheldon        Copper  42.0          9.00          25.0       NaN
1        Raj  Koothrappali  38.0          7.00          25.0       NaN
2    Leonard    Hofstadter   NaN          7.25          49.0       NaN
3     Howard      Wolowitz  41.0          8.00           NaN       NaN
4        Amy        Fowler  35.0          5.00          70.0       NaN

  first_name     last_name   age  Comedy_Score  Rating_Score  location
0    Sheldon        Copper  42.0           9.0          25.0       NaN
1        Raj  Koothrappali  38.0           7.0          25.0       NaN
4        Amy        Fowler  35.0           5.0          70.0       NaN

  first_name     last_name   age  Comedy_Score  Rating_Score  location
0    Sheldon        Copper  42.0           9.0          25.0       0.0
1        Raj  Koothrappali  38.0           7.0          25.0       0.0
4        Amy        Fowler  35.0           5.0          70.0       0.0
​

Download Materials

What Users are saying..

profile image

Ray han

Tech Leader | Stanford / Yale University
linkedin profile url

I think that they are fantastic. I attended Yale and Stanford and have worked at Honeywell,Oracle, and Arthur Andersen(Accenture) in the US. I have taken Big Data and Hadoop,NoSQL, Spark, Hadoop... Read More

Relevant Projects

Multilabel Classification Project for Predicting Shipment Modes
Multilabel Classification Project to build a machine learning model that predicts the appropriate mode of transport for each shipment, using a transport dataset with 2000 unique products. The project explores and compares four different approaches to multilabel classification, including naive independent models, classifier chains, natively multilabel models, and multilabel to multiclass approaches.

Machine Learning Project to Forecast Rossmann Store Sales
In this machine learning project you will work on creating a robust prediction model of Rossmann's daily sales using store, promotion, and competitor data.

Skip Gram Model Python Implementation for Word Embeddings
Skip-Gram Model word2vec Example -Learn how to implement the skip gram algorithm in NLP for word embeddings on a set of documents.

Build an optimal End-to-End MLOps Pipeline and Deploy on GCP
Learn how to build and deploy an end-to-end optimal MLOps Pipeline for Loan Eligibility Prediction Model in Python on GCP

Build a Music Recommendation Algorithm using KKBox's Dataset
Music Recommendation Project using Machine Learning - Use the KKBox dataset to predict the chances of a user listening to a song again after their very first noticeable listening event.

Azure Deep Learning-Deploy RNN CNN models for TimeSeries
In this Azure MLOps Project, you will learn to perform docker-based deployment of RNN and CNN Models for Time Series Forecasting on Azure Cloud.

A/B Testing Approach for Comparing Performance of ML Models
The objective of this project is to compare the performance of BERT and DistilBERT models for building an efficient Question and Answering system. Using A/B testing approach, we explore the effectiveness and efficiency of both models and determine which one is better suited for Q&A tasks.

Build a Graph Based Recommendation System in Python-Part 2
In this Graph Based Recommender System Project, you will build a recommender system project for eCommerce platforms and learn to use FAISS for efficient similarity search.

Hands-On Approach to Master PyTorch Tensors with Examples
In this deep learning project, you will learn how to perform various operations on the building block of PyTorch : Tensors.

Deploy Transformer BART Model for Text summarization on GCP
Learn to Deploy a Machine Learning Model for the Abstractive Text Summarization on Google Cloud Platform (GCP)