How to perform data validation using python by sorting columns on name?

How to perform data validation using python by sorting columns on name?

How to perform data validation using python by sorting columns on name?

This recipe helps you perform data validation using python by sorting columns on name


Recipe Objective

In most of the big data scenarios , Sorting/ordering is a process of arranging items systematically, and has two common, yet distinct meanings: ordering: arranging items in a sequence ordered by some criterion; categorizing: grouping items with similar properties. Ordering of the field names is extremely required before storing the data in the required order , so that it can be further processed by creating schema on top of it and doing further processing.

System requirements :

  • Install the python module as follows if the below modules are not found:
  • pip install pandas
  • The below codes can be run in Jupyter notebook, or any python console

Step 1: Import the module

In this scenario we are going to use pandas numpy and random libraries import the libraries as below :

import pandas as pd

Step 2 :Prepare the dataset

Here we are using the stock sales related excel dataset to from the local Please download from here Click Here to Download

Data of Output looks as follows:

Reading the excel file using pandas library as below and storing data as data frame and print 5 line of data.

df = pd.read_excel('sales_data.xlsx') df.head()

Output of the above code :

To print only the column names type df.columns

Step 3: Add underscore instead of Space

Here in the code below we are going to add underscore instead of space using the replace method and to convert the column names into uppercase upper method.

for col in df.columns: df.rename(columns={col:col.upper().replace(" ","_")},inplace=True) print(df.columns)

Output of the above code :

Step 4: Arrange or sorting the columns

Here we are going to sort and arrange the columns based on the column name, so it means that we arrange the columns in alphabetical order.

df = df[sorted(df)] validation = df validation['chk'] = validation['ACCOUNT_NUMBER'].apply(lambda x: True if x in df else False) validation = validation[validation['chk'] == True].reset_index() df

Output of the above code :

Create a new dataframe and rename the column names as below

renamed_data = df.rename(columns = {'ACCOUNT_NUMBER': 'acnt_num','DATE':'buy_date','EXT_PRICE':'ext_price','NAME':'acc_holder_name', 'QUANTITY':'quantity','SKU':'stock_keep_unit', 'UNIT_PRICE':'unit_price'}) renamed_data.columns renamed_data

Output of the above code :

Step 5: Convert string to Date Data type and Select the row between dates

Here we have a column buy_date is string data type so we are going to convert the string data type date data type and we will select the row in between dates as follows below

Convert the string to Date datatype as below code:

renamed_data['buy_date'] = pd.to_datetime(renamed_data['buy_date']) renamed_data['buy_date'].head()

Output of the above code :

Convert the string to Date datatype as below code:

start_date = pd.to_datetime('2014-01-08 17:46:40') end_date = pd.to_datetime('2014-01-23 20:40:21') renamed_data.loc[(renamed_data['buy_date'] > start_date) & (renamed_data['buy_date'] < end_date)]

Output of the above code :

Relevant Projects

Build a big data pipeline with AWS Quicksight, Druid, and Hive
Use the dataset on aviation for analytics to simulate a complex real-world big data pipeline based on messaging with AWS Quicksight, Druid, NiFi, Kafka, and Hive.

Tough engineering choices with large datasets in Hive Part - 1
Explore hive usage efficiently in this hadoop hive project using various file formats such as JSON, CSV, ORC, AVRO and compare their relative performances

Real-Time Log Processing in Kafka for Streaming Architecture
The goal of this apache kafka project is to process log entries from applications in real-time using Kafka for the streaming architecture in a microservice sense.

PySpark Tutorial - Learn to use Apache Spark with Python
PySpark Project-Get a handle on using Python with Spark through this hands-on data processing spark python tutorial.

Create A Data Pipeline Based On Messaging Using PySpark And Hive - Covid-19 Analysis
In this PySpark project, you will simulate a complex real-world data pipeline based on messaging. This project is deployed using the following tech stack - NiFi, PySpark, Hive, HDFS, Kafka, Airflow, Tableau and AWS QuickSight.

Explore features of Spark SQL in practice on Spark 2.0
The goal of this spark project for students is to explore the features of Spark SQL in practice on the latest version of Spark i.e. Spark 2.0.

Analysing Big Data with Twitter Sentiments using Spark Streaming
In this big data spark project, we will do Twitter sentiment analysis using spark streaming on the incoming streaming data.

Spark Project-Analysis and Visualization on Yelp Dataset
The goal of this Spark project is to analyze business reviews from Yelp dataset and ingest the final output of data processing in Elastic Search.Also, use the visualisation tool in the ELK stack to visualize various kinds of ad-hoc reports from the data.

Analyse Yelp Dataset with Spark & Parquet Format on Azure Databricks
In this Databricks Azure project, you will use Spark & Parquet file formats to analyse the Yelp reviews dataset. As part of this you will deploy Azure data factory, data pipelines and visualise the analysis.

Tough engineering choices with large datasets in Hive Part - 2
This is in continuation of the previous Hive project "Tough engineering choices with large datasets in Hive Part - 1", where we will work on processing big data sets using Hive.