Recipe: How to deal with missing values in a Pandas DataFrame?
DATA MUNGING PYTHON PANDAS DATAFRAME PANDAS CHEATSHEET PANDAS DATAFRAME TUTORIAL

How to deal with missing values in a Pandas DataFrame?

This recipe helps you deal with missing values in a Pandas DataFrame
In [1]:
## How to deal with missing values in a Pandas DataFrame
def Kickstarter_Example_94():
    print()
    print(format('How to deal with missing values in a Pandas DataFrame','*^82'))

    import warnings
    warnings.filterwarnings("ignore")

    # load libraries
    import pandas as pd
    import numpy as np

    # Create dataframe with missing values
    raw_data = {'first_name': ['Jason', np.nan, 'Tina', 'Jake', 'Amy'],
                'last_name': ['Miller', np.nan, 'Ali', 'Milner', 'Cooze'],
                'age': [42, np.nan, 36, 24, 73],
                'sex': ['m', np.nan, 'f', 'm', 'f'],
                'preTestScore': [4, np.nan, np.nan, 2, 3],
                'postTestScore': [25, np.nan, np.nan, 62, 70]}
    df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'sex',
                                           'preTestScore', 'postTestScore'])
    print(); print(df)

    # Drop missing observations
    df_no_missing = df.dropna()
    print(); print(df_no_missing)

    # Drop rows where all cells in that row is NA
    df_cleaned = df.dropna(how='all')
    print(); print(df_cleaned)

    # Create a new column full of missing values
    df['location'] = np.nan
    print(); print(df)

    # Drop column if they only contain missing values
    print(); print(df.dropna(axis=1, how='all'))

    # Drop rows that contain less than five observations
    # This is really mostly useful for time series
    print(); print(df.dropna(thresh=5))

    # Fill in missing data with zeros
    print(); print(df.fillna(0))

    # Fill in missing in preTestScore with the mean value of preTestScore
    # inplace=True means that the changes are saved to the df right away
    df["preTestScore"].fillna(df["preTestScore"].mean(), inplace=True)
    print(); print(df)

    # Fill in missing in postTestScore with each sex’s mean value of postTestScore
    df["postTestScore"].fillna(df.groupby("sex")["postTestScore"].transform("mean"), inplace=True)
    print(); print(df)

    # Select the rows of df where age is not NaN and sex is not NaN
    print(); print(df[df['age'].notnull() & df['sex'].notnull()])
    print(); print(df[df['age'].notnull() & df['sex'].notnull()].fillna(0))

Kickstarter_Example_94()
**************How to deal with missing values in a Pandas DataFrame***************

  first_name last_name   age  sex  preTestScore  postTestScore
0      Jason    Miller  42.0    m           4.0           25.0
1        NaN       NaN   NaN  NaN           NaN            NaN
2       Tina       Ali  36.0    f           NaN            NaN
3       Jake    Milner  24.0    m           2.0           62.0
4        Amy     Cooze  73.0    f           3.0           70.0

  first_name last_name   age sex  preTestScore  postTestScore
0      Jason    Miller  42.0   m           4.0           25.0
3       Jake    Milner  24.0   m           2.0           62.0
4        Amy     Cooze  73.0   f           3.0           70.0

  first_name last_name   age sex  preTestScore  postTestScore
0      Jason    Miller  42.0   m           4.0           25.0
2       Tina       Ali  36.0   f           NaN            NaN
3       Jake    Milner  24.0   m           2.0           62.0
4        Amy     Cooze  73.0   f           3.0           70.0

  first_name last_name   age  sex  preTestScore  postTestScore  location
0      Jason    Miller  42.0    m           4.0           25.0       NaN
1        NaN       NaN   NaN  NaN           NaN            NaN       NaN
2       Tina       Ali  36.0    f           NaN            NaN       NaN
3       Jake    Milner  24.0    m           2.0           62.0       NaN
4        Amy     Cooze  73.0    f           3.0           70.0       NaN

  first_name last_name   age  sex  preTestScore  postTestScore
0      Jason    Miller  42.0    m           4.0           25.0
1        NaN       NaN   NaN  NaN           NaN            NaN
2       Tina       Ali  36.0    f           NaN            NaN
3       Jake    Milner  24.0    m           2.0           62.0
4        Amy     Cooze  73.0    f           3.0           70.0

  first_name last_name   age sex  preTestScore  postTestScore  location
0      Jason    Miller  42.0   m           4.0           25.0       NaN
3       Jake    Milner  24.0   m           2.0           62.0       NaN
4        Amy     Cooze  73.0   f           3.0           70.0       NaN

  first_name last_name   age sex  preTestScore  postTestScore  location
0      Jason    Miller  42.0   m           4.0           25.0       0.0
1          0         0   0.0   0           0.0            0.0       0.0
2       Tina       Ali  36.0   f           0.0            0.0       0.0
3       Jake    Milner  24.0   m           2.0           62.0       0.0
4        Amy     Cooze  73.0   f           3.0           70.0       0.0

  first_name last_name   age  sex  preTestScore  postTestScore  location
0      Jason    Miller  42.0    m           4.0           25.0       NaN
1        NaN       NaN   NaN  NaN           3.0            NaN       NaN
2       Tina       Ali  36.0    f           3.0            NaN       NaN
3       Jake    Milner  24.0    m           2.0           62.0       NaN
4        Amy     Cooze  73.0    f           3.0           70.0       NaN

  first_name last_name   age  sex  preTestScore  postTestScore  location
0      Jason    Miller  42.0    m           4.0           25.0       NaN
1        NaN       NaN   NaN  NaN           3.0            NaN       NaN
2       Tina       Ali  36.0    f           3.0           70.0       NaN
3       Jake    Milner  24.0    m           2.0           62.0       NaN
4        Amy     Cooze  73.0    f           3.0           70.0       NaN

  first_name last_name   age sex  preTestScore  postTestScore  location
0      Jason    Miller  42.0   m           4.0           25.0       NaN
2       Tina       Ali  36.0   f           3.0           70.0       NaN
3       Jake    Milner  24.0   m           2.0           62.0       NaN
4        Amy     Cooze  73.0   f           3.0           70.0       NaN

  first_name last_name   age sex  preTestScore  postTestScore  location
0      Jason    Miller  42.0   m           4.0           25.0       0.0
2       Tina       Ali  36.0   f           3.0           70.0       0.0
3       Jake    Milner  24.0   m           2.0           62.0       0.0
4        Amy     Cooze  73.0   f           3.0           70.0       0.0


Stuck at work?
Can't find the recipe you are looking for. Let us know and we will find an expert to create the recipe for you. Click here
Companies using this Recipe
1 developer from Capgemini
1 developer from HexaWare
1 developer from Infosys
1 developer from LTI
1 developer from Renovite Technologies
1 developer from Sysco
1 developer from Altimetrik
1 developer from Coursera
1 developer from HvH
1 developer from ISO Associates