How to deal with missing values in a Pandas DataFrame?

This recipe helps you deal with missing values in a Pandas DataFrame

Recipe Objective

In a dataset its very normal that we can get missing values and we can not use that missing values in models. So how to deal with missing values.

So this is the recipe on how we can deal with missing values in a Pandas DataFrame.

Step 1 - Import the library

import pandas as pd import numpy as np

We have imported numpy and pandas which will be needed for the dataset.

Step 2 - Setting up the Data

We have created a dataframe with different features like "first_name", "last_name", "age", "comedy_score" and "Rating_Score". raw_data = {"first_name": ["Sheldon", "Raj", "Leonard", "Howard", "Amy"], "last_name": ["Copper", "Koothrappali", "Hofstadter", "Wolowitz", "Fowler"], "age": [42, 38, np.nan, 41, 35], "Comedy_Score": [9, 7, np.nan, 8, 5], "Rating_Score": [25, 25, 49, np.nan, 70]} df = pd.DataFrame(raw_data, columns = ["first_name", "last_name", "age", "Comedy_Score", "Rating_Score"]) print(df)

Step 3 - Dealing with missing values

Here we will be using different methods to deal with missing values.

    • Droping missing observations

df_no_missing = df.dropna() print(df_no_missing)

    • Droping rows where all cells in that row is NA

df_cleaned = df.dropna(how="all") print(df_cleaned)

    • Creating a new column full of missing values

df3 = df.bfill(); print(df3)

    • Creating a new column full of missing values

df["location"] = np.nan print(df)

    • Droping column if they only contain missing values

print(df.dropna(axis=1, how="all"))

    • Droping rows that contain less than five observations

print(df.dropna(thresh=5))

    • Filling in missing data with zeros

print(df.fillna(0))

    • Filling in missing in Comedy_Score with the mean value of Comedy_Score

df["Comedy_Score"].fillna(df["Comedy_Score"].mean(), inplace=True) print(df)

    • Filling in missing in Comedy_Score with each age’s mean value of Comedy_Score

df["Comedy_Score"].fillna(df.groupby("age")["Comedy_Score"].transform("mean"), inplace=True) print(df)

    • Selecting the rows of df where age is not NaN and age is not NaN

print(df[df["age"].notnull() & df["Rating_Score"].notnull()]) print(df[df["age"].notnull() & df["Rating_Score"].notnull()].fillna(0))

So the output comes as:

  first_name     last_name   age  Comedy_Score  Rating_Score
0    Sheldon        Copper  42.0           9.0          25.0
1        Raj  Koothrappali  38.0           7.0          25.0
2    Leonard    Hofstadter   NaN           NaN          49.0
3     Howard      Wolowitz  41.0           8.0           NaN
4        Amy        Fowler  35.0           5.0          70.0

  first_name     last_name   age  Comedy_Score  Rating_Score
0    Sheldon        Copper  42.0           9.0          25.0
1        Raj  Koothrappali  38.0           7.0          25.0
4        Amy        Fowler  35.0           5.0          70.0

  first_name     last_name   age  Comedy_Score  Rating_Score
0    Sheldon        Copper  42.0           9.0          25.0
1        Raj  Koothrappali  38.0           7.0          25.0
2    Leonard    Hofstadter   NaN           NaN          49.0
3     Howard      Wolowitz  41.0           8.0           NaN
4        Amy        Fowler  35.0           5.0          70.0

  first_name     last_name   age  Comedy_Score  Rating_Score  location
0    Sheldon        Copper  42.0           9.0          25.0       NaN
1        Raj  Koothrappali  38.0           7.0          25.0       NaN
2    Leonard    Hofstadter   NaN           NaN          49.0       NaN
3     Howard      Wolowitz  41.0           8.0           NaN       NaN
4        Amy        Fowler  35.0           5.0          70.0       NaN

  first_name     last_name   age  Comedy_Score  Rating_Score
0    Sheldon        Copper  42.0           9.0          25.0
1        Raj  Koothrappali  38.0           7.0          25.0
2    Leonard    Hofstadter   NaN           NaN          49.0
3     Howard      Wolowitz  41.0           8.0           NaN
4        Amy        Fowler  35.0           5.0          70.0

  first_name     last_name   age  Comedy_Score  Rating_Score  location
0    Sheldon        Copper  42.0           9.0          25.0       NaN
1        Raj  Koothrappali  38.0           7.0          25.0       NaN
4        Amy        Fowler  35.0           5.0          70.0       NaN

  first_name     last_name   age  Comedy_Score  Rating_Score  location
0    Sheldon        Copper  42.0           9.0          25.0       0.0
1        Raj  Koothrappali  38.0           7.0          25.0       0.0
2    Leonard    Hofstadter   0.0           0.0          49.0       0.0
3     Howard      Wolowitz  41.0           8.0           0.0       0.0
4        Amy        Fowler  35.0           5.0          70.0       0.0

  first_name     last_name   age  Comedy_Score  Rating_Score  location
0    Sheldon        Copper  42.0          9.00          25.0       NaN
1        Raj  Koothrappali  38.0          7.00          25.0       NaN
2    Leonard    Hofstadter   NaN          7.25          49.0       NaN
3     Howard      Wolowitz  41.0          8.00           NaN       NaN
4        Amy        Fowler  35.0          5.00          70.0       NaN

  first_name     last_name   age  Comedy_Score  Rating_Score  location
0    Sheldon        Copper  42.0          9.00          25.0       NaN
1        Raj  Koothrappali  38.0          7.00          25.0       NaN
2    Leonard    Hofstadter   NaN          7.25          49.0       NaN
3     Howard      Wolowitz  41.0          8.00           NaN       NaN
4        Amy        Fowler  35.0          5.00          70.0       NaN

  first_name     last_name   age  Comedy_Score  Rating_Score  location
0    Sheldon        Copper  42.0           9.0          25.0       NaN
1        Raj  Koothrappali  38.0           7.0          25.0       NaN
4        Amy        Fowler  35.0           5.0          70.0       NaN

  first_name     last_name   age  Comedy_Score  Rating_Score  location
0    Sheldon        Copper  42.0           9.0          25.0       0.0
1        Raj  Koothrappali  38.0           7.0          25.0       0.0
4        Amy        Fowler  35.0           5.0          70.0       0.0
​

Download Materials

What Users are saying..

profile image

Savvy Sahai

Data Science Intern, Capgemini
linkedin profile url

As a student looking to break into the field of data engineering and data science, one can get really confused as to which path to take. Very few ways to do it are Google, YouTube, etc. I was one of... Read More

Relevant Projects

Build a Music Recommendation Algorithm using KKBox's Dataset
Music Recommendation Project using Machine Learning - Use the KKBox dataset to predict the chances of a user listening to a song again after their very first noticeable listening event.

AWS MLOps Project to Deploy Multiple Linear Regression Model
Build and Deploy a Multiple Linear Regression Model in Python on AWS

Build Multi Class Text Classification Models with RNN and LSTM
In this Deep Learning Project, you will use the customer complaints data about consumer financial products to build multi-class text classification models using RNN and LSTM.

Time Series Forecasting with LSTM Neural Network Python
Deep Learning Project- Learn to apply deep learning paradigm to forecast univariate time series data.

BERT Text Classification using DistilBERT and ALBERT Models
This Project Explains how to perform Text Classification using ALBERT and DistilBERT

Word2Vec and FastText Word Embedding with Gensim in Python
In this NLP Project, you will learn how to use the popular topic modelling library Gensim for implementing two state-of-the-art word embedding methods Word2Vec and FastText models.

Build Classification Algorithms for Digital Transformation[Banking]
Implement a machine learning approach using various classification techniques in Python to examine the digitalisation process of bank customers.

Avocado Machine Learning Project Python for Price Prediction
In this ML Project, you will use the Avocado dataset to build a machine learning model to predict the average price of avocado which is continuous in nature based on region and varieties of avocado.

Time Series Forecasting Project-Building ARIMA Model in Python
Build a time series ARIMA model in Python to forecast the use of arrival rate density to support staffing decisions at call centres.

Build ARCH and GARCH Models in Time Series using Python
In this Project we will build an ARCH and a GARCH model using Python