What is Data Validation in Python?

A simple guide showcasing what is data validation and how to implement data validation in Python using Pandas.

Data Validation in Python is a crucial aspect of ensuring data accuracy and reliability. Python offers a variety of libraries, such as Pandas, to automate data validation processes, making it essential for fields like data science, business analytics, and software development. This article dives into the concept of data validation, explores data validation techniques in Python, and provides a step-by-step guide to automating data validation using Python and Pandas.

Learn to Build a Neural network from Scratch using NumPy 

What is Data Validation in Python?

Data validation in Python is the process of ensuring that data is accurate, consistent, and reliable. It involves verifying that data meets predefined criteria and rules to maintain its quality. This critical step prevents erroneous or unreliable information from entering the system. Python offers numerous libraries and tools, such as Pandas, NumPy, and regular expressions, to automate data validation processes. By implementing data validation, Python developers can identify and handle errors, anomalies, or inconsistencies in datasets, enhancing the overall integrity and trustworthiness of the data, which is essential for data-driven decision-making, analysis, and reporting in various applications, including data science, business analytics, and software development.

Data Validation Techniques in Python

Data validation techniques in Python encompass a variety of methods to ensure the quality and reliability of data. These techniques include:

  • Type Checking: Verifying data types (integers, strings, etc.) to ensure compatibility.

  • Range Validation: Ensuring data falls within specified numeric ranges.

  • Format Validation: Validating data format using regular expressions or predefined patterns.

  • Missing Values Check: Confirming mandatory data fields are not empty.

  • Cross-field Validation: Checking relationships between fields to detect inconsistencies.

  • Database Constraints: Implementing database constraints like primary keys, foreign keys, and unique indexes.

  • Custom Validation Functions: Creating custom functions for specific validation requirements.

These techniques are used to validate data in Python, and they help maintain data integrity and enhance its usability across diverse Python applications.

How to do Data Validation in Python Pandas?

Let us explore the step by step guide on performing data validation using Python.

Step 1: Import the module

In this scenario we are going to import the Python Data validation library- Pandas.

import pandas as pd

Step 2 :Prepare the dataset

Python Data Validation looks as follows:

Python Data Validation looks as follows

Step 3: Validate CSV file Python

To validate the data frame is empty or not using below code as follows :

def read_file(): df = pd.read_csv(filename) if(df.empty): print ('CSV file is empty') else: print ('CSV file is not empty') return df

Pass the file name as the argument as below :

filename ='C:\\Users\\nfinity\\Downloads\\Data sets\\supermarket_sales.csv'

Call the function as belows is

df = read_file()

Output of the above code :

output

Using pandas library to determine the csv data datatype by iterating the rows :

import pandas as pd df = pd.read_csv(supermarket_sales.csv', nrows=2) for dtype in df.dtypes.iteritems(): print(dtype)

Or also we can easily know the data types by using below code :

df.types

Output of the above lines :

Step 4: Processing the matched columns

Here in this scenario we are going to processing only matched columns between validation and input data arrange the columns based on the column name as below.

import pandas as pd data = pd.read_csv('C:\\Users\\nfinity\\Downloads\\Data sets\\supermarket_sales.csv') df = df[sorted(data)] validation = df validation['chk'] = validation['Invoice ID'].apply(lambda x: True if x in df else False) validation = validation[validation['chk'] == True].reset_index() df

Output of the above code :

Output of the above code

Step 5: Check Data Type convert as Date column

Here in this scenario we are going to check the columns data types and and convert the date column as below code:

for col in df.columns: if df[col].dtype == 'object': try: df[col] = pd.to_datetime(df[col]) except ValueError: pass print(df.dtypes)

Output of the above code:

renamed_data['buy_date'] = pd.to_datetime(renamed_data['buy_date']) renamed_data['buy_date'].head()

Output of the above code :

Output of the above code

Step 6: Python Excel Data Validation

Here we are going to validating the data to checking the missing values, below code will loop the data column values and check if the columns has any missing value is as follow below

for col in df.columns: miss = df[col].isnull().sum() if miss>0: print("{} has {} missing value(s)".format(col,miss)) else: print("{} has NO missing value!".format(col))

Output of the above code :

Output of the above code

How to automate Data Validation?

Let's go through a step-by-step guide data validation automation in Python with a practical example using Python library Pandas:

Step 1: Define Data Validation Rules

Suppose you have a dataset of sales transactions and want to validate it. Your rules could include checking for non-negative sales amounts, ensuring proper date formatting, and verifying that product IDs are in the correct format.

Step 2: Choose the Right Tool

Python, along with the Pandas library, is an excellent choice for data validation. It provides powerful tools for data manipulation and validation.

Step 3: Load Your Data

Load a sample dataset using Pandas:

import pandas as pd

# Load the dataset, say sales dataset

df = pd.read_csv('sales_data.csv')

Step 4: Implement Data Validation Checks

Write code to perform data validation checks. For example,

To validate sales amounts:

# Check for non-negative sales amounts

invalid_sales = df[df['sales_amount'] < 0]

if not invalid_sales.empty:

    raise ValueError('Invalid sales amounts found.')

To validate date formatting:

# Check date formatting

from datetime import datetime

invalid_dates = df[df['date'].apply(lambda x: not isinstance(x, datetime))]

if not invalid_dates.empty:

    raise ValueError('Invalid date formatting found.')

You can similarly implement checks for other rules.

Step 5: Automate the Process

Create functions or scripts to encapsulate these data validation checks. This makes it easy to run them with a single command.

Step 6: Handle Validation Failures

Define actions for handling validation failures. For example, raise exceptions with descriptive error messages, log errors, or attempt data cleaning and correction.

Step 7: Schedule Regular Validation

Use a scheduling tool or data validation framework in Python (e.g., Pandera or Great Expectations) to run your data validation script at regular intervals.

Step 8: Monitor and Report

Implement a monitoring system to generate reports on validation results. This can be as simple as sending email notifications when errors occur.

Step 9: Fine-Tune and Update Rules

As the dataset evolves, regularly review and update your validation rules to adapt to changes and maintain data quality.

Step 10: Document the Process

Document the entire automated data validation process, including rules, tools, and schedules, for future reference and for sharing with team members.

In this example, automating data validation ensures that your sales dataset adheres to specific rules, such as non-negative sales amounts and proper date formatting, reducing the risk of data quality issues in your analyses.

Master Data Validation in Python with ProjectPro!

Data validation is a fundamental component of data-related tasks in various industries. To enhance your skills in Python and data validation, consider enrolling in ProjectPro's comprehensive courses. Whether you are a beginner or an experienced professional, ProjectPro offers a wide range of courses, from Python basics to advanced data analytics. So, take a deep dive into the world of data science and big data projects offered by ProjectPro and give the much-needed boost to your career.

What Users are saying..

profile image

Abhinav Agarwal

Graduate Student at Northwestern University
linkedin profile url

I come from Northwestern University, which is ranked 9th in the US. Although the high-quality academics at school taught me all the basics I needed, obtaining practical experience was a challenge.... Read More

Relevant Projects

Forecasting Business KPI's with Tensorflow and Python
In this machine learning project, you will use the video clip of an IPL match played between CSK and RCB to forecast key performance indicators like the number of appearances of a brand logo, the frames, and the shortest and longest area percentage in the video.

PyTorch Project to Build a LSTM Text Classification Model
In this PyTorch Project you will learn how to build an LSTM Text Classification model for Classifying the Reviews of an App .

Build an End-to-End AWS SageMaker Classification Model
MLOps on AWS SageMaker -Learn to Build an End-to-End Classification Model on SageMaker to predict a patient’s cause of death.

OpenCV Project for Beginners to Learn Computer Vision Basics
In this OpenCV project, you will learn computer vision basics and the fundamentals of OpenCV library using Python.

Image Segmentation using Mask R-CNN with Tensorflow
In this Deep Learning Project on Image Segmentation Python, you will learn how to implement the Mask R-CNN model for early fire detection.

NLP Project on LDA Topic Modelling Python using RACE Dataset
Use the RACE dataset to extract a dominant topic from each document and perform LDA topic modeling in python.

FEAST Feature Store Example for Scaling Machine Learning
FEAST Feature Store Example- Learn to use FEAST Feature Store to manage, store, and discover features for customer churn prediction machine learning project.

End-to-End Snowflake Healthcare Analytics Project on AWS-1
In this Snowflake Healthcare Analytics Project, you will leverage Snowflake on AWS to predict patient length of stay (LOS) in hospitals. The prediction of LOS can help in efficient resource allocation, lower the risk of staff/visitor infections, and improve overall hospital functioning.

Isolation Forest Model and LOF for Anomaly Detection in Python
Credit Card Fraud Detection Project - Build an Isolation Forest Model and Local Outlier Factor (LOF) in Python to identify fraudulent credit card transactions.

Build CI/CD Pipeline for Machine Learning Projects using Jenkins
In this project, you will learn how to create a CI/CD pipeline for a search engine application using Jenkins.