How to perform data validation using python by processing only matched columns?

How to perform data validation using python by processing only matched columns?

How to perform data validation using python by processing only matched columns?

This recipe helps you perform data validation using python by processing only matched columns


Recipe Objective

In most of the big data scenarios , Data validation is checking the accuracy and quality of source data before using, importing or otherwise processing data. Different types of validation can be performed depending on destination constraints or objectives. Data validation is a form of data cleansing. In most of the production environments , data validation is a key step in data pipelines.

System requirements :

  • Install the python module as follows if the below modules are not found:
  • pip install pandas
  • The below codes can be run in Jupyter notebook, or any python console

Step 1: Import the module

In this scenario we are going to use pandas numpy and random libraries import the libraries as below :

import pandas as pd

Step 2 :Prepare the dataset

Data of Output looks as follows:

Step 3: Validate the data frame

To validate the data frame is empty or not using below code as follows :

def read_file(): df = pd.read_csv(filename) if(df.empty): print ('CSV file is empty') else: print ('CSV file is not empty') return df

Pass the file name as the argument as below :

filename ='C:\\Users\\nfinity\\Downloads\\Data sets\\supermarket_sales.csv'

Call the function as belows is

df = read_file()

Output of the above code :

Using pandas library to determine the csv data datatype by iterating the rows :

import pandas as pd df = pd.read_csv(supermarket_sales.csv', nrows=2) for dtype in df.dtypes.iteritems(): print(dtype)

Or also we can easily know the data types by using below code :


Output of the above lines :

Step 4: Processing the matched columns

Here in this scenario we are going to processing only matched columns between validation and input data arrange the columns based on the column name as below.

import pandas as pd data = pd.read_csv('C:\\Users\\nfinity\\Downloads\\Data sets\\supermarket_sales.csv') df = df[sorted(data)] validation = df validation['chk'] = validation['Invoice ID'].apply(lambda x: True if x in df else False) validation = validation[validation['chk'] == True].reset_index() df

Output of the above code :

Step 5: Check Data Type convert as Date column

Here in this scenario we are going to check the columns data types and and convert the date column as below code:

for col in df.columns: if df[col].dtype == 'object': try: df[col] = pd.to_datetime(df[col]) except ValueError: pass print(df.dtypes)

Output of the above code:

renamed_data['buy_date'] = pd.to_datetime(renamed_data['buy_date']) renamed_data['buy_date'].head()

Output of the above code :

Step 6: validate data to check missing values

Here we are going to validating the data to checking the missing values, below code will loop the data column values and check if the columns has any missing value is as follow below

for col in df.columns: miss = df[col].isnull().sum() if miss>0: print("{} has {} missing value(s)".format(col,miss)) else: print("{} has NO missing value!".format(col))

Output of the above code :

Relevant Projects

PySpark Tutorial - Learn to use Apache Spark with Python
PySpark Project-Get a handle on using Python with Spark through this hands-on data processing spark python tutorial.

Hadoop Project for Beginners-SQL Analytics with Hive
In this hadoop project, learn about the features in Hive that allow us to perform analytical queries over large datasets.

Analyse Yelp Dataset with Spark & Parquet Format on Azure Databricks
In this Databricks Azure project, you will use Spark & Parquet file formats to analyse the Yelp reviews dataset. As part of this you will deploy Azure data factory, data pipelines and visualise the analysis.

Hadoop Project-Analysis of Yelp Dataset using Hadoop Hive
The goal of this hadoop project is to apply some data engineering principles to Yelp Dataset in the areas of processing, storage, and retrieval.

Analysing Big Data with Twitter Sentiments using Spark Streaming
In this big data spark project, we will do Twitter sentiment analysis using spark streaming on the incoming streaming data.

Event Data Analysis using AWS ELK Stack
This Elasticsearch example deploys the AWS ELK stack to analyse streaming event data. Tools used include Nifi, PySpark, Elasticsearch, Logstash and Kibana for visualisation.

Data Warehouse Design for E-commerce Environments
In this hive project, you will design a data warehouse for e-commerce environments.

Implementing Slow Changing Dimensions in a Data Warehouse using Hive and Spark
Hive Project- Understand the various types of SCDs and implement these slowly changing dimesnsion in Hadoop Hive and Spark.

Web Server Log Processing using Hadoop
In this hadoop project, you will be using a sample application log file from an application server to a demonstrated scaled-down server log processing pipeline.

Explore features of Spark SQL in practice on Spark 2.0
The goal of this spark project for students is to explore the features of Spark SQL in practice on the latest version of Spark i.e. Spark 2.0.