DATA MUNGING PYTHON PANDAS DATAFRAME PANDAS CHEATSHEET PANDAS DATAFRAME TUTORIAL

How to deal with missing values in a Pandas DataFrame?

This recipe helps you deal with missing values in a Pandas DataFrame
In [1]:
## How to deal with missing values in a Pandas DataFrame
def Kickstarter_Example_94():
    print()
    print(format('How to deal with missing values in a Pandas DataFrame','*^82'))

    import warnings
    warnings.filterwarnings("ignore")

    # load libraries
    import pandas as pd
    import numpy as np

    # Create dataframe with missing values
    raw_data = {'first_name': ['Jason', np.nan, 'Tina', 'Jake', 'Amy'],
                'last_name': ['Miller', np.nan, 'Ali', 'Milner', 'Cooze'],
                'age': [42, np.nan, 36, 24, 73],
                'sex': ['m', np.nan, 'f', 'm', 'f'],
                'preTestScore': [4, np.nan, np.nan, 2, 3],
                'postTestScore': [25, np.nan, np.nan, 62, 70]}
    df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'sex',
                                           'preTestScore', 'postTestScore'])
    print(); print(df)

    # Drop missing observations
    df_no_missing = df.dropna()
    print(); print(df_no_missing)

    # Drop rows where all cells in that row is NA
    df_cleaned = df.dropna(how='all')
    print(); print(df_cleaned)

    # Create a new column full of missing values
    df['location'] = np.nan
    print(); print(df)

    # Drop column if they only contain missing values
    print(); print(df.dropna(axis=1, how='all'))

    # Drop rows that contain less than five observations
    # This is really mostly useful for time series
    print(); print(df.dropna(thresh=5))

    # Fill in missing data with zeros
    print(); print(df.fillna(0))

    # Fill in missing in preTestScore with the mean value of preTestScore
    # inplace=True means that the changes are saved to the df right away
    df["preTestScore"].fillna(df["preTestScore"].mean(), inplace=True)
    print(); print(df)

    # Fill in missing in postTestScore with each sex’s mean value of postTestScore
    df["postTestScore"].fillna(df.groupby("sex")["postTestScore"].transform("mean"), inplace=True)
    print(); print(df)

    # Select the rows of df where age is not NaN and sex is not NaN
    print(); print(df[df['age'].notnull() & df['sex'].notnull()])
    print(); print(df[df['age'].notnull() & df['sex'].notnull()].fillna(0))

Kickstarter_Example_94()
**************How to deal with missing values in a Pandas DataFrame***************

  first_name last_name   age  sex  preTestScore  postTestScore
0      Jason    Miller  42.0    m           4.0           25.0
1        NaN       NaN   NaN  NaN           NaN            NaN
2       Tina       Ali  36.0    f           NaN            NaN
3       Jake    Milner  24.0    m           2.0           62.0
4        Amy     Cooze  73.0    f           3.0           70.0

  first_name last_name   age sex  preTestScore  postTestScore
0      Jason    Miller  42.0   m           4.0           25.0
3       Jake    Milner  24.0   m           2.0           62.0
4        Amy     Cooze  73.0   f           3.0           70.0

  first_name last_name   age sex  preTestScore  postTestScore
0      Jason    Miller  42.0   m           4.0           25.0
2       Tina       Ali  36.0   f           NaN            NaN
3       Jake    Milner  24.0   m           2.0           62.0
4        Amy     Cooze  73.0   f           3.0           70.0

  first_name last_name   age  sex  preTestScore  postTestScore  location
0      Jason    Miller  42.0    m           4.0           25.0       NaN
1        NaN       NaN   NaN  NaN           NaN            NaN       NaN
2       Tina       Ali  36.0    f           NaN            NaN       NaN
3       Jake    Milner  24.0    m           2.0           62.0       NaN
4        Amy     Cooze  73.0    f           3.0           70.0       NaN

  first_name last_name   age  sex  preTestScore  postTestScore
0      Jason    Miller  42.0    m           4.0           25.0
1        NaN       NaN   NaN  NaN           NaN            NaN
2       Tina       Ali  36.0    f           NaN            NaN
3       Jake    Milner  24.0    m           2.0           62.0
4        Amy     Cooze  73.0    f           3.0           70.0

  first_name last_name   age sex  preTestScore  postTestScore  location
0      Jason    Miller  42.0   m           4.0           25.0       NaN
3       Jake    Milner  24.0   m           2.0           62.0       NaN
4        Amy     Cooze  73.0   f           3.0           70.0       NaN

  first_name last_name   age sex  preTestScore  postTestScore  location
0      Jason    Miller  42.0   m           4.0           25.0       0.0
1          0         0   0.0   0           0.0            0.0       0.0
2       Tina       Ali  36.0   f           0.0            0.0       0.0
3       Jake    Milner  24.0   m           2.0           62.0       0.0
4        Amy     Cooze  73.0   f           3.0           70.0       0.0

  first_name last_name   age  sex  preTestScore  postTestScore  location
0      Jason    Miller  42.0    m           4.0           25.0       NaN
1        NaN       NaN   NaN  NaN           3.0            NaN       NaN
2       Tina       Ali  36.0    f           3.0            NaN       NaN
3       Jake    Milner  24.0    m           2.0           62.0       NaN
4        Amy     Cooze  73.0    f           3.0           70.0       NaN

  first_name last_name   age  sex  preTestScore  postTestScore  location
0      Jason    Miller  42.0    m           4.0           25.0       NaN
1        NaN       NaN   NaN  NaN           3.0            NaN       NaN
2       Tina       Ali  36.0    f           3.0           70.0       NaN
3       Jake    Milner  24.0    m           2.0           62.0       NaN
4        Amy     Cooze  73.0    f           3.0           70.0       NaN

  first_name last_name   age sex  preTestScore  postTestScore  location
0      Jason    Miller  42.0   m           4.0           25.0       NaN
2       Tina       Ali  36.0   f           3.0           70.0       NaN
3       Jake    Milner  24.0   m           2.0           62.0       NaN
4        Amy     Cooze  73.0   f           3.0           70.0       NaN

  first_name last_name   age sex  preTestScore  postTestScore  location
0      Jason    Miller  42.0   m           4.0           25.0       0.0
2       Tina       Ali  36.0   f           3.0           70.0       0.0
3       Jake    Milner  24.0   m           2.0           62.0       0.0
4        Amy     Cooze  73.0   f           3.0           70.0       0.0


Companies using this Recipe
1 employee of S&P Global
1 employee of ICU Medical
1 employee of Scotiabank
1 employee of Altimetrik
1 employee of KPMG
1 employee of YASH Technologies
1 employee of ANAC
1 employee of LTI
1 employee of Ericsson
1 employee of Renovite Technologies