Data Cleaning Techniques in Data Mining and Machine Learning

The Ultimate Guide to effective data cleaning techniques in data mining and machine learning for your next machine learning project | Project Pro

Data Cleaning Techniques in Data Mining and Machine Learning
 |  BY Badr Salah

As per statistics, we produce 2.5 Quintillion Bytes of data per day. With such a vast amount of data available, dealing with and processing data has become the main concern for companies.


Build Regression (Linear,Ridge,Lasso) Models in NumPy Python

Downloadable solution code | Explanatory videos | Tech Support

Start Project

The problem lies in the real-world data. The real-world data is rarely clean and homogenous, meaning the data available isn't impeccable enough to make good business decisions. This is one of the reasons companies are willing to spend money on hiring skilled professionals with data cleaning and visualization expertise. Unclean data usually occurs due to human error, scraping data, or combining multiple data sources. We have to clean data before starting to analyze data and build machine learning models. Data scientists spend 70% of their time finding optimal ways to clean data. Bad quality or unclean data is likely to result in inaccurate insights.

Why are Data Cleaning Techniques Important?

As you set your foot in the data world, the first thing you come across is handling data, and by that, we mean cleaning data. Here are a few data quality issues you will likely come across when working on any real-world data science or machine learning project -

  • Missing Data/ Null Values

  • Duplicate Data.

  • Outliers present in Data

  • Erroneous Data.

  • Presence of Irrelevant Data.

Building a machine learning model that understands the data requires a lot of cleaning and pre-processing. Machines cannot learn how humans learn from data, requiring the data to be converted to machine-level inputs for better understanding. Data cleaning plays a significant role in building a good model.

Data Cleaning Techniques in Machine Learning

 Every data scientist must have a good understanding of the following data cleaning techniques in machine learning to have solid data for making better business decisions -

The most common data quality issue that data scientists often encounter is handling missing data, which significantly affects business analysis and statistical analysis. Missing Data or Missing values occur when we have NO data points stored for a particular column or feature. There might be multiple data sources for creating a data set. These Data sources might not have actual observations, or people might be too lazy to fill in the correct data, which leads to corrupt data. Different data sources may indicate missing values in various ways to make analysis even more complicated and can significantly impact the conclusions drawn from data.

Remember, not all NULL data is corrupt; sometimes, we will need to accept missing values; this scenario is entirely dependent on the data set and the type of business problem.

In this section, we will discuss the general consideration for missing values, discuss how Pandas (Python Library) chooses to represent it, and discuss how some built-in Pandas tool help in handling missing values. You may encounter Missing values as null, NaN(Not a Number), or NA values also.

Operating on Null Values

Pandas provides various functions for Detecting, Removing, and Replacing Null values. They are:

  • isnull() - Generates a Boolean Mask(True/False) indicating Missing value.

  • notnull() - Completely opposite of isnull(), generates a Boolean Mask.

  • dropna() - Returns a filtered version of the data present.

  • fillna() - Returns a copy of the dataframe with missing values replaced or imputed.

Detecting Null Values

Pandas' data structure has two useful ways for detecting null data. isnull() and notnull(). Either of them will return a Boolean value(True/False).

Data Cleaning Technique to Handle Null Values

True - Null entry, False - No Null data.

Dropping Null Values

In addition to the masking method, there are other methods to remove missing values and fill the missing values with data. The data cleaning techniques mentioned above are dependent on business analysis. Before dropping/replacing the missing value, we need to analyze data and find if dropping them might create data quality issues. We can use Python's dropna() function of Python to drop null data.

Data Cleaning Technique to Drop Null Values

For a dataframe, there might be many rows and columns with Null values; we can only drop full rows or full columns. Depending on the application, we might need to drop either of them, so dropna() gives an option to do this for a given dataset.

Data Cleaning Technique to Handle Null Values

Filling Null Values

Sometimes Null values might not be that irrelevant data, and if we get to know some patterns in the null values present, we might be able to deduce clean data. Pandas provide the fillna() method, which returns a copy of the array with null values replaced.

Data Cleaning Technique to Handle Null Values

Work on these data science project examples and get a step closer to your dream of becoming a data scientist!

2. Handling Duplicate Data

Data is collected either via scraping data or by combining different data sources. There is a high chance of duplicate entries, which might occur from human error. Duplicate data will indeed create analytical outcomes, leading to lousy business analytics. It also leads to incorrect reporting; they become less reliable, and making predictions based on duplicate values will surely hamper the business target.

This simple code below helps to find the duplicate values and return the values without any duplicate observations.

Data Cleaning Technique to Handle Duplicate Values

While working with a dataframe in Pandas, we might want to use the function pandas.dataframe.duplicated() to find duplicate observations or entries. It outputs a series of boolean values (True/False) which helps to identify whether a row is duplicate or unique.

Data Cleaning Methods to Handle Duplicate Values

From the above image, we can see that row number 0 and row 6 are duplicate rows with the same data. To avoid data redundancy and retain only valid values, the code snippet below helps clean data to remove duplicates.

Here's what valued users are saying about ProjectPro

I am the Director of Data Analytics with over 10+ years of IT experience. I have a background in SQL, Python, and Big Data working with Accenture, IBM, and Infosys. I am looking to enhance my skills in Data Engineering/Science and hoping to find real-world projects fortunately, I came across...

Ed Godalle

Director Data Analytics at EY / EY Tech

As a student looking to break into the field of data engineering and data science, one can get really confused as to which path to take. Very few ways to do it are Google, YouTube, etc. I was one of them too, and that's when I came across ProjectPro while watching one of the SQL videos on the...

Savvy Sahai

Data Science Intern, Capgemini

Not sure what you are looking for?

View All Projects

The parameter keep='first' keeps the first entry and removes all other duplicates.

Data Cleaning Technique to Handle Duplicate Data

3. Dealing with Outliers.

Outliers are data entries whose values deviate significantly from the rest of the data. Outliers can be found in every possible real-world dataset you come across, and dealing with Outliers is one of the many data cleaning techniques. Outliers also have a significance on data accuracy and business analytics. Most machine learning algorithms, predominantly linear regression models, need to be dealt with Outliers, or else the Variance of the model could turn out very high, which further leads to false conclusions by the model.

Detecting Outliers

The two most efficient business practices for detecting outliers are:

  1. Normal Distribution.

  2. Box-plots.

Normal Distribution

Also known as the Bell curve, Normal distribution helps us visualize a particular feature's distribution. The following shows what a Normal distribution looks like - 

Normal Distribution for Detecting Outliers

µ represents the Mean, and σ represents the Standard Deviation.

As per the Normal Distribution, 68.2% of data should lie between the 1st Standard Deviation, 95% of the data should lie between the 2nd Standard Deviation, and 99.7% of data should lie between the 3rd Standard Deviation. Usually, the data present post 3rd Standard Deviation is considered an Outlier.

Box-Plot

Box-plot is a visualization technique used in Python to represent data distribution.

Box Plot for Detecting Outliers

A five-number summary represents the Box plot:

  1. Minimum Value: The lowest data point of a feature.

  2. Maximum Value: The Maximum data point of a feature.

  3. Median: The middle data point of a feature.

  4. Lower Quartile: This range of data contains 25% of data.

  5. Upper Quartile: This range includes 75% of the data.

A box plot usually includes two parts, a Box and a set of whiskers as shown in the above image. IQR (Interquartile range) is the distance between the upper and the lower quartiles.

How to Deal with Outliers using Python Pandas

There are two primary ways of dealing with Outliers:

  1. Trimming.

  2. Creating a Threshold.

Trimming

This data cleaning technique eliminates outlier values from the data sets and completely ignores the values that deviate significantly from the normal distribution of the data. In a Box plot, any values above 1.5 IQR are considered an outlier and removed from the feature.

Creating a Threshold

This method is most efficient when you have original data with very few data points, and incomplete data might result in false conclusions, which is a poor business strategy. Hence, using this data cleaning technique, we will convert all the values that cross 1.5 IQR to a set value, which will help us keep quality data.

Unlock the ProjectPro Learning Experience for FREE

4. Erroneous Data

One of the most common problems while cleaning data is dealing with incorrect or corrupt data and converting it to clean data. This data cleaning technique requires much domain knowledge to make appropriate data transformations. It also goes without saying that if our data has errors (human error), our data cleaning process discussed above will bear no good results.

A few examples of this type of error include,

  1. Spelling Mistake.

  2. '@' is not present in an Email Address.

  3. Inconsistent Formatting such as date format.

  4. Making sure financial records are kept under the same currency denomination.

  5. Presence of more than 1 Language due to multiple sources of data.

  6. Maintaining the correct Data Type.

Erroneous Data

Data Science practitioners always have pre-defined functions ready to validate the data and ensure dirty data is taken care of using basic validation. Monitoring errors of these kinds helps you in more efficient business practices and provides more valuable outcomes.

Access to a curated library of 250+ end-to-end industry projects with solution code, videos and tech support.

Request a demo

5. Removal of Irrelevant Data

Usually, while creating a dataset, we try to scrape data from various data sources and combine data sets, also known as Data collection. To solve a particular business problem, we might need various features from these stored data, but using all features might not be that helpful to our business problem.

Analyzing data that provides no weightage to a business problem is of no value; hence, removing irrelevant data becomes a common practice in data cleansing. Hence deciphering the relevancy of data and extracting clean data becomes an important step in the data cleaning process.

Examples of Irrelevant Data

  1. Suppose we have a column with features such as Age of a person and Birth Year, then keeping both these features would mean keeping the same data which in turn creates duplicate data and hampers the data cleaning process.

  2. Suppose we are clustering our customers based on their purchase history; features such as Email Address, Mobile Number, and Nationality would be useless.

  3. Similarly, if we are dealing with Sales of a product in the United States, then keeping data from Bangladesh would seem irrelevant.

Note: If you are 100% sure that a feature is irrelevant should you use this data cleaning method, or else we might use Statistics to find out its relevance and use it accordingly.

Data Cleaning Process in Data Mining.

Data Mining is a process used by big companies to turn raw data into useful information, such as discovering trends and patterns. Nowadays, social media companies use the Data Mining process heavily, where they mine personal information to influence preferences. This process captures a user's interests and likings and breaks down patterns to provide more useful services.

The data mining process uses models such as Clustering, Decision Trees, and Neural Networks to convert a massive collection of data into useful output.

By studying the below-mentioned process steps, we analyze huge datasets and convert them to meaningful insights.

Use these R projects for practice of R programming lanaguage and learn data science today!

Data Cleaning Techniques in Data Mining

  • Supervising Errors.

  • Normalizing the Mining process.

  • Verifying Data Precision.

  • Handling of Outliers/ Duplicate/ Dirty Data.

  • Monitoring Data flow.

Supervising Errors

Keep track of where most mistakes occur and ensure they do not repeat. It helps in making sure corrupt data doesn't enter the data flow. This process is compulsory when we need to get the data from various sources, as each source might have different formatting. Before scraped data is used, we need to monitor the quality, security, and consistency to ensure the resultant data is useful.

Normalizing The Mining Process

Normalize the data source point from where data is collected and inserted in the database to reduce duplicity and provide efficient data flow. Before combining data from various sources, we need to normalize the data so that data integration takes place without errors. This process uses data cleaning tools to ensure accumulated data is standardized and formatted as per requirement.

Verifying Data Precision

Analyzing data and investing in proper tools to clean and update data in real-time helps in data accuracy and predictions. Data Cleansing is important during ETL(extract, transform and load) for creating reports and business analytics.

Handling of Outliers/ Duplicate/ Dirty Data

Look for Duplicates and Outliers to save time while analyzing the data using appropriate data cleaning tools. Frequently working on the same data source can be avoided by analyzing and handling data with Duplicates and Outliers. Data Transformation is required before using the data, and data cleansing tools help clean data using built-in transformation techniques.

Monitoring Data Flow

Once Data is prepared via Data Integration and migration, create a pipeline for the Data flow from one cleaning process to the other. Once the data is cleaned, ensure that the clean data is replaced with the bad data not present in the source.

Data Cleaning Techniques in Data Warehouse

Data Warehouse is a repo for storing and finding key insights from data to make better business decisions. It is a data management system that facilitates and supports Business Intelligence. It is topic oriented and integrated via several data sources into a consistent format. It is  time-variant, meaning it has a timestamp to track the data coming into the system.

The data warehouse needs consistent, accurate, and deduplicated data to pass on to the data analytics process. Here are a few examples of data cleaning in data warehouses:

  1. Schema-Related Issues

  2. Instance - Related Issues

Schema-Related Issues

Schema is a logical representation of a data table. Here we define the data type of the features available in the data warehouse.

For example, for a feature AGE, the schema is Integer; if we have an entry for Age as Twenty-five, the system will automatically throw an error.

Another example, Birthdate = 17.28.1994, error - Non-permissible values because entered values for a birthdate are outside the limit because there can't be the 28th month.

Issues like two employees having the same ID come under Schema-Related issues.

Instance-Related Issues

This issue may arise from misspellings during data entry and incorrect references. In multiple sources, it would also mean that the same feature is represented differently in different sources.

Example: 

Misspellings (state_name = Bangaloree). 

Incorrect_reference (emp_id = 22), while the actual id was 12.

As we can see, if these issues are not fixed, Query using this data will return incorrect information, potentially leading to business decisions being based on flawed data.

Data Cleaning Tools

Here is a list of online data cleaning tools to help you clean messy data with ease -

  • OpenRefine

  • Drake

  • Data Cleaner.

  • Winpure

  • Cloudindigo

Get FREE Access to Data Analytics Example Codes for Data Cleaning, Data Munging, and Data Visualization

5 Data Cleaning Project Ideas to Help You Master the Art of Data Cleaning

Now that we have gone through various data cleaning techniques in machine learning and data mining, it’s time to get your hands dirty with these beginner-friendly data cleaning project ideas for practice -

  1. Titanic Dataset - Clean this Messy data to find the probabilities of a person surviving the Titanic.

  2. New York City Airbnb Open Data - This Dataset describes the listing activity and metrics in NYC; use the cleaning techniques or methods mentioned above to find out more about hosts and geographical availability.

  3. Hotel Booking Demand - Find the best time to book your hotels or the optimal time of stay to get the best prices.

  4. Taxi Trajectory Data - Taxi trips over time, clean your data and find interesting facts on Taxi service for an overall year.

  5. Trending YouTube Video Stats - Use Data Cleaning techniques to determine what factors affect a YouTube video's popularity.

These projects will help you practice and get your hands dirty on various data cleansing methods and improve your Data Science knowledge.

Data cleansing techniques might seem a bit time-consuming to any data science practitioner. However, they are an integral part of the data science project workflow. Obtaining high-quality data in data mining and machine learning is essential to the data science process. Ignoring missing values, duplicate data, improper data entry, and data collection processes from multichannel data sources leads to a poor business strategy. Once you have cleaned your data, you'll need the right data cleaning tools for analyzing data and presenting them in your reports.

 

PREVIOUS

NEXT

Access Solved Big Data and Data Science Projects

About the Author

BadrSalah

A computer science graduate with over four years of writing experience in various fields. His passion for technology and knack for clear communication enables him to simplify complex topics for readers. Fun fact: Badr has a mixed-breed dog named

Meet The Author arrow link