Data Cleaning in Python

This blog talks about the various ways of cleaning up datasets in python for your data science or machine learning project

Get access to all Data Analytics Projects View all Data Analytics Projects

Last Updated: 11 Apr 2024 | BY ProjectPro

Over 70% of the work you will do as a Data Scientist on any Data Science or Statistics project is cleaning your data and manipulating it to make it ready for modelling and analysis.

Build Time Series Models for Gaussian Processes in Python

Downloadable solution code | Explanatory videos | Tech Support

Start Project

Data Cleaning Tutorial Steps

What is Data Cleaning ?

At the start of a data science project, you will inherit multiple data-sets from different teams. You will then be asked to solve for a specific business problem. Your solution may not need all the data you got - you might have to remove columns, modify columns, remove duplicate values, deal with missing values, deal with outlier data etc. Sometimes you will also need to normalize or scale data to make the data fit within a range.

This critical time consuming step is data cleaning or data cleansing. As part of the data cleansing process you will also perform EDA (Exploratory Data Analysis) - here you will visualise the data using graphs and statistical functions to understand the underlying data - mean, median, range, distribution etc.

This Wikipedia explanation gives a good overview of Data Cleaning from a more generic perspective.

Here's what valued users are saying about ProjectPro

As a student looking to break into the field of data engineering and data science, one can get really confused as to which path to take. Very few ways to do it are Google, YouTube, etc. I was one of them too, and that's when I came across ProjectPro while watching one of the SQL videos on the...

Savvy Sahai

Data Science Intern, Capgemini

Having worked in the field of Data Science, I wanted to explore how I can implement projects in other domains, So I thought of connecting with ProjectPro. A project that helped me absorb this topic was "Credit Risk Modelling". To understand other domains, it is important to wear a thinking cap and...

Gautam Vermani

Data Consultant at Confidential

Not sure what you are looking for?

View All Projects

Why is Data Cleaning Important ?

Though most courses, blogs, online materials focus on the modelling aspect of data science, it is the data cleaning step that determines how easy your modelling is going to be. The better your data is, the less complex your learning algorithms need to be. Better structured data that provides the right input values will also determine the accuracy of your predictions.

New Projects

Data cleaning impacts efficiency of rest of your data modeling and decision making process. Hence professional data scientists treat this step as critical as the algorithm building step. Efficiency gains of data cleaning results from:

1) Lesser processing time

2) More accurate predictions

3) Simpler algorithms needed

4) Better learning ability of the model

This Harvard Business Review article conveys the criticality of data cleaning. It basically says that “If Your Data Is Bad, Your Machine Learning Tools Are Useless”.

Get Closer To Your Dream of Becoming a Data Scientist with 70+ Solved End-to-End ML Projects

Data Cleaning Tutorial Steps

The following data cleaning tutorial will walk you through the steps in data cleaning with detailed examples and reusable code snippets.

1. Randomly sample a pandas dataframe

The first step in data cleaning is to quickly get an idea of what is inside your dataset. Randomly picking a few rows to view will help you achieve that.
print(df.take(np.random.permutation(len(df))[:2]))

this command uses 3 functions df.take (), np.random.permutation() and len() to print 2 randomly selected rows from the dataframe df().

len() just measures the length of the dataframe which serves as an input to np.random.permutation(). The value 2 indicates that any 2 rows can be picked up at random by the function.

df.take() selects rows from the dataframe. If you did not want to randomly pick rows, then you can just pass the number of rows as a parameter to df.take().

Example tutorial:

Check out this data cleaning tutorial on how to randomly sample a pandas dataframe

Get Closer To Your Dream of Becoming a Data Scientist with 70+ Solved End-to-End ML Projects

2. Drop row and columns in a pandas dataframe

Deleting rows and columns that don’t fit your analysis is a critical part of data modelling. The df.drop()functions is used to drop rows or columns in a pandas dataframe.

df.drop('Salary', axis=1)
will drop a column named “salary”. Axis=1 indicates that we are referring to a column and not a row.

You can also drop columns based on conditions

df.drop[df.name != 'Fia']
will drop a row where the value of ‘name’ is not ‘Fia’

Example Tutorial:
Check out data cleaning tutorial to see an example of how to drop row and columns in a pandas datafame.

3. How to Join and Merge Pandas dataframe

In pretty much all your data science projects, your data will be present across multiple files. Sometimes you will have columns from different files to aggregate.

Python provides a very powerful library pandas that has functions to join, concatenate and merge multiple files. The following are possible with a pandas dataframe - join dataframes along rows, join dataframes along columns, merge dataframes along a column, merge with outer join, merge with inner join, merge with right join, merge with left join, merge while adding a suffx to duplicate column names, merge based on indexes etc.

df_new = pd.concat([df_a, df_b])
merges 2 dataframes df_a and df_b along the rows

df = pd.concat([df_a, df_b], axis=1)
merges 2 dataframes df_a and df_b along the columns

The pd.concat() basically just appends one dataframe to the other either along the rows or the columns.

Get confident to build end-to-end projects

Access to a curated library of 250+ end-to-end industry projects with solution code, videos and tech support.

Request a demo

The difference between a merge and a join is that a merge simply combines dataframes and includes all the rows while a join combines dataframes based on a matching key value - for example you might want to combine dataframe rows based on a employee id match. By defaulty all joins are also inner joins. An outer join will return everything an inner join returns along with all unmatched rows in the dataframe.

Example Tutorial:
Check out data cleaning tutorial to see an example of how to Join and Merge a pandas datafame.

Get More Practice, More Data Science and Machine Learning Projects, and More guidance.Fast-Track Your Career Transition with ProjectPro

4. How to deal with missing values in data cleaning

The data you inherit for analysis will come from multiple sources and would have been pulled adhoc. So this data will not be immediately ready for you to run any kind of model on. One of the most common issues you will have to deal with is missing values in the dataset. There are many reasons why values might be missing - intentional, user did not fill up, online forms broken, accidentally deleted, legacy issues etc.

Either way you will need to fix this problem. There are 3 ways to do this - either you will ignore the missing values, delete the missing value rows or fill the missing values with an approximation.

Its easiest to just drop the missing observations but you need to very careful before you do that, because the absence of a value might actually be conveying some information about the data pattern. If you decide to drop missing values -

df_no_missing = df.dropna()
will drop any rows with any value missing. Even if some values are available in a row it will still get dropped even if a single value is missing.

df_cleaned = df.dropna(how='all')
will only drop rows where all cells are NA or missing values.

To drop columns, you will have to add the ‘axis=1’ parameter to the above functions.

Example Tutorial:
Check out data cleaning tutorial to see an example of how to deal with missing values in data cleaning.

Access Data Science and Machine Learning Project Code Examples

ProjectPro

ProjectPro is the only online platform designed to help professionals gain practical, hands-on experience in big data, data engineering, data science, and machine learning related technologies. Having over 270+ reusable project templates in data science and big data with step-by-step walkthroughs,

Meet The Author