Data Cleaning in Python

Over 70% of the work you will do as a Data Scientist on any Data Science or Statistics project is cleaning your data and manipulating it to make it ready for modelling and analysis. 


What is Data Cleaning ?
At the start of a data science project, you will inherit multiple data-sets from different teams. You will then be asked to solve for a specific business problem. Your solution may not need all the data you got - you might have to remove columns, modify columns, remove duplicate values, deal with missing values, deal with outlier data etc. Sometimes you will also need to normalize or scale data to make the data fit within a range. 

This critical time consuming step is data cleaning or data cleansing. As part of the data cleansing process you will also perform EDA (Exploratory Data Analysis) - here you will visualise the data using graphs and statistical functions to understand the underlying data - mean, median, range, distribution etc. 

This Wikipedia explanation gives a good overview of Data Cleaning from a more generic perspective. 



Why is Data Cleaning Important ?
Though most courses, blogs, online materials focus on the modelling aspect of data science, it is the data cleaning step that determines how easy your modelling is going to be. The better your data is, the less complex your learning algorithms need to be. Better structured data that provides the right input values will also determine the accuracy of your predictions. 

Data cleaning impacts efficiency of rest of your data modeling and decision making process. Hence professional data scientists treat this step as critical as the algorithm building step. Efficiency gains of data cleaning results from:

1) Lesser processing time 

2) More accurate predictions

3) Simpler algorithms needed

4) Better learning ability of the model

This Harvard Business Review article conveys the criticality of data cleaning. It basically says that “If Your Data Is Bad, Your Machine Learning Tools Are Useless”. 



Data Cleaning Tutorial Steps
The following data cleaning tutorial will walk you through the steps in data cleaning with detailed examples and reusable code snippets.


1. Randomly sample a pandas dataframe
The first step in data cleaning is to quickly get an idea of what is inside your dataset. Randomly picking a few rows to view will help you achieve that. 

print(df.take(np.random.permutation(len(df))[:2]))

this command uses 3 functions df.take (), np.random.permutation() and len() to print 2 randomly selected rows from the dataframe df(). 

len() just measures the length of the dataframe which serves as an input to np.random.permutation(). The value 2 indicates that any 2 rows can be picked up at random by the function. 

df.take() selects rows from the dataframe. If you did not want to randomly pick rows, then you can just pass the number of rows as a parameter to df.take().  

Example tutorial:

Check out this data cleaning tutorial on how to randomly sample a pandas dataframe



2. Drop row and columns in a pandas dataframe
Deleting rows and columns that don’t fit your analysis is a critical part of data modelling. The df.drop()functions is used to drop rows or columns in a pandas dataframe.  

df.drop('Salary', axis=1)
will drop a column named “salary”. Axis=1 indicates that we are referring to a column and not a row. 

You can also drop columns based on conditions

df.drop[df.name != 'Fia']
will drop a row where the value of ‘name’ is not ‘Fia’

Example Tutorial:
Check out data cleaning tutorial to see an example of how to drop row and columns in a pandas datafame



3. How to Join and Merge Pandas dataframe
In pretty much all your data science projects, your data will be present across multiple files. Sometimes you will have columns from different files to aggregate.

Python provides a very powerful library pandas that has functions to join, concatenate and merge multiple files. The following are possible with a pandas dataframe - join dataframes along rows, join dataframes along columns, merge dataframes along a column, merge with outer join, merge with inner join, merge with right join, merge with left join, merge while adding a suffx to duplicate column names, merge based on indexes etc. 

df_new = pd.concat([df_a, df_b])
merges 2 dataframes df_a and df_b along the rows

df = pd.concat([df_a, df_b], axis=1)
merges 2 dataframes df_a and df_b along the columns

The pd.concat() basically just appends one dataframe to the other either along the rows or the columns.   

The difference between a merge and a join is that a merge simply combines dataframes and includes all the rows while a join combines dataframes based on a matching key value - for example you might want to combine dataframe rows based on a employee id match. By defaulty all joins are also inner joins. An outer join will return everything an inner join returns along with all unmatched rows in the dataframe. 

Example Tutorial:
Check out data cleaning tutorial to see an example of how to Join and Merge a pandas datafame



4. How to deal with missing values in data cleaning 
The data you inherit for analysis will come from multiple sources and would have been pulled adhoc. So this data will not be immediately ready for you to run any kind of model on. One of the most common issues you will have to deal with is missing values in the dataset. There are many reasons why values might be missing - intentional, user did not fill up, online forms broken, accidentally deleted, legacy issues etc. 

Either way you will need to fix this problem. There are 3 ways to do this - either you will ignore the missing values, delete the missing value rows or fill the missing values with an approximation.

Its easiest to just drop the missing observations but you need to very careful before you do that, because the absence of a value might actually be conveying some information about the data pattern. If you decide to drop missing values - 

df_no_missing = df.dropna()
will drop any rows with any value missing. Even if some values are available in a row it will still get dropped even if a single value is missing. 

df_cleaned = df.dropna(how='all')
will only drop rows where all cells are NA or missing values. 

To drop columns, you will have to add the ‘axis=1’ parameter to the above functions. 

Example Tutorial:
Check out data cleaning tutorial to see an example of how to deal with missing values in data cleaning.

 

Reference:
The following references were used to compose this tutorial:
https://pandas.pydata.org/pandas-docs/stable/reference/frame.html