Data Cleaning in Python

Data Cleaning in Python

Over 70% of the work you will do as a Data Scientist on any Data Science or Statistics project is cleaning your data and manipulating it to make it ready for modelling and analysis. 

What is Data Cleaning ?
At the start of a data science project, you will inherit multiple data-sets from different teams. You will then be asked to solve for a specific business problem. Your solution may not need all the data you got - you might have to remove columns, modify columns, remove duplicate values, deal with missing values, deal with outlier data etc. Sometimes you will also need to normalize or scale data to make the data fit within a range. 

This critical time consuming step is data cleaning or data cleansing. As part of the data cleansing process you will also perform EDA (Exploratory Data Analysis) - here you will visualise the data using graphs and statistical functions to understand the underlying data - mean, median, range, distribution etc. 

This Wikipedia explanation gives a good overview of Data Cleaning from a more generic perspective. 

Why is Data Cleaning Important ?
Though most courses, blogs, online materials focus on the modelling aspect of data science, it is the data cleaning step that determines how easy your modelling is going to be. The better your data is, the less complex your learning algorithms need to be. Better structured data that provides the right input values will also determine the accuracy of your predictions. 

Data cleaning impacts efficiency of rest of your data modeling and decision making process. Hence professional data scientists treat this step as critical as the algorithm building step. Efficiency gains of data cleaning results from:

1) Lesser processing time 

2) More accurate predictions

3) Simpler algorithms needed

4) Better learning ability of the model

This Harvard Business Review article conveys the criticality of data cleaning. It basically says that “If Your Data Is Bad, Your Machine Learning Tools Are Useless”. 

Data Cleaning Tutorial Steps
The following data cleaning tutorial will walk you through the steps in data cleaning with detailed examples and reusable code snippets.

1. Randomly sample a pandas dataframe
The first step in data cleaning is to quickly get an idea of what is inside your dataset. Randomly picking a few rows to view will help you achieve that. 


this command uses 3 functions df.take (), np.random.permutation() and len() to print 2 randomly selected rows from the dataframe df(). 

len() just measures the length of the dataframe which serves as an input to np.random.permutation(). The value 2 indicates that any 2 rows can be picked up at random by the function. 

df.take() selects rows from the dataframe. If you did not want to randomly pick rows, then you can just pass the number of rows as a parameter to df.take().  

Example tutorial:

Check out this data cleaning tutorial on how to randomly sample a pandas dataframe

2. Drop row and columns in a pandas dataframe
Deleting rows and columns that don’t fit your analysis is a critical part of data modelling. The df.drop()functions is used to drop rows or columns in a pandas dataframe.  

df.drop('Salary', axis=1)
will drop a column named “salary”. Axis=1 indicates that we are referring to a column and not a row. 

You can also drop columns based on conditions

df.drop[ != 'Fia']
will drop a row where the value of ‘name’ is not ‘Fia’

Example Tutorial:
Check out data cleaning tutorial to see an example of how to drop row and columns in a pandas datafame

3. How to Join and Merge Pandas dataframe
In pretty much all your data science projects, your data will be present across multiple files. Sometimes you will have columns from different files to aggregate.

Python provides a very powerful library pandas that has functions to join, concatenate and merge multiple files. The following are possible with a pandas dataframe - join dataframes along rows, join dataframes along columns, merge dataframes along a column, merge with outer join, merge with inner join, merge with right join, merge with left join, merge while adding a suffx to duplicate column names, merge based on indexes etc. 

df_new = pd.concat([df_a, df_b])
merges 2 dataframes df_a and df_b along the rows

df = pd.concat([df_a, df_b], axis=1)
merges 2 dataframes df_a and df_b along the columns

The pd.concat() basically just appends one dataframe to the other either along the rows or the columns.   

The difference between a merge and a join is that a merge simply combines dataframes and includes all the rows while a join combines dataframes based on a matching key value - for example you might want to combine dataframe rows based on a employee id match. By defaulty all joins are also inner joins. An outer join will return everything an inner join returns along with all unmatched rows in the dataframe. 

Example Tutorial:
Check out data cleaning tutorial to see an example of how to Join and Merge a pandas datafame

4. How to deal with missing values in data cleaning 
The data you inherit for analysis will come from multiple sources and would have been pulled adhoc. So this data will not be immediately ready for you to run any kind of model on. One of the most common issues you will have to deal with is missing values in the dataset. There are many reasons why values might be missing - intentional, user did not fill up, online forms broken, accidentally deleted, legacy issues etc. 

Either way you will need to fix this problem. There are 3 ways to do this - either you will ignore the missing values, delete the missing value rows or fill the missing values with an approximation.

Its easiest to just drop the missing observations but you need to very careful before you do that, because the absence of a value might actually be conveying some information about the data pattern. If you decide to drop missing values - 

df_no_missing = df.dropna()
will drop any rows with any value missing. Even if some values are available in a row it will still get dropped even if a single value is missing. 

df_cleaned = df.dropna(how='all')
will only drop rows where all cells are NA or missing values. 

To drop columns, you will have to add the ‘axis=1’ parameter to the above functions. 

Example Tutorial:
Check out data cleaning tutorial to see an example of how to deal with missing values in data cleaning.


The following references were used to compose this tutorial:


Relevant Projects

Music Recommendation System Project using Python and R
Machine Learning Project - Work with KKBOX's Music Recommendation System dataset to build the best music recommendation engine.

Forecast Inventory demand using historical sales data in R
In this machine learning project, you will develop a machine learning model to accurately forecast inventory demand based on historical sales data.

Predict Credit Default | Give Me Some Credit Kaggle
In this data science project, you will predict borrowers chance of defaulting on credit loans by building a credit score prediction model.

Time Series Forecasting with LSTM Neural Network Python
Deep Learning Project- Learn to apply deep learning paradigm to forecast univariate time series data.

Deep Learning with Keras in R to Predict Customer Churn
In this deep learning project, we will predict customer churn using Artificial Neural Networks and learn how to model an ANN in R with the keras deep learning package.

Data Science Project-All State Insurance Claims Severity Prediction
Data science project in R to develop automated methods for predicting the cost and severity of insurance claims.

Credit Card Fraud Detection as a Classification Problem
In this data science project, we will predict the credit card fraud in the transactional dataset using some of the predictive models.

German Credit Dataset Analysis to Classify Loan Applications
In this data science project, you will work with German credit dataset using classification techniques like Decision Tree, Neural Networks etc to classify loan applications using R.

Data Science Project-TalkingData AdTracking Fraud Detection
Machine Learning Project in R-Detect fraudulent click traffic for mobile app ads using R data science programming language.

Loan Eligibility Prediction using Gradient Boosting Classifier
This data science in python project predicts if a loan should be given to an applicant or not. We predict if the customer is eligible for loan based on several factors like credit score and past history.