How to perform data validation using python by sorting columns on name?

How to perform data validation using python by sorting columns on name?

How to perform data validation using python by sorting columns on name?

This recipe helps you perform data validation using python by sorting columns on name


Recipe Objective

In most of the big data scenarios , Sorting/ordering is a process of arranging items systematically, and has two common, yet distinct meanings: ordering: arranging items in a sequence ordered by some criterion; categorizing: grouping items with similar properties. Ordering of the field names is extremely required before storing the data in the required order , so that it can be further processed by creating schema on top of it and doing further processing.

System requirements :

  • Install the python module as follows if the below modules are not found:
  • pip install pandas
  • The below codes can be run in Jupyter notebook, or any python console

Step 1: Import the module

In this scenario we are going to use pandas numpy and random libraries import the libraries as below :

import pandas as pd

Step 2 :Prepare the dataset

Here we are using the stock sales related excel dataset to from the local Please download from here Click Here to Download

Data of Output looks as follows:

Reading the excel file using pandas library as below and storing data as data frame and print 5 line of data.

df = pd.read_excel('sales_data.xlsx') df.head()

Output of the above code :

To print only the column names type df.columns

Step 3: Add underscore instead of Space

Here in the code below we are going to add underscore instead of space using the replace method and to convert the column names into uppercase upper method.

for col in df.columns: df.rename(columns={col:col.upper().replace(" ","_")},inplace=True) print(df.columns)

Output of the above code :

Step 4: Arrange or sorting the columns

Here we are going to sort and arrange the columns based on the column name, so it means that we arrange the columns in alphabetical order.

df = df[sorted(df)] validation = df validation['chk'] = validation['ACCOUNT_NUMBER'].apply(lambda x: True if x in df else False) validation = validation[validation['chk'] == True].reset_index() df

Output of the above code :

Create a new dataframe and rename the column names as below

renamed_data = df.rename(columns = {'ACCOUNT_NUMBER': 'acnt_num','DATE':'buy_date','EXT_PRICE':'ext_price','NAME':'acc_holder_name', 'QUANTITY':'quantity','SKU':'stock_keep_unit', 'UNIT_PRICE':'unit_price'}) renamed_data.columns renamed_data

Output of the above code :

Step 5: Convert string to Date Data type and Select the row between dates

Here we have a column buy_date is string data type so we are going to convert the string data type date data type and we will select the row in between dates as follows below

Convert the string to Date datatype as below code:

renamed_data['buy_date'] = pd.to_datetime(renamed_data['buy_date']) renamed_data['buy_date'].head()

Output of the above code :

Convert the string to Date datatype as below code:

start_date = pd.to_datetime('2014-01-08 17:46:40') end_date = pd.to_datetime('2014-01-23 20:40:21') renamed_data.loc[(renamed_data['buy_date'] > start_date) & (renamed_data['buy_date'] < end_date)]

Output of the above code :

Relevant Projects

Spark Project-Analysis and Visualization on Yelp Dataset
The goal of this Spark project is to analyze business reviews from Yelp dataset and ingest the final output of data processing in Elastic Search.Also, use the visualisation tool in the ELK stack to visualize various kinds of ad-hoc reports from the data.

Explore features of Spark SQL in practice on Spark 2.0
The goal of this spark project for students is to explore the features of Spark SQL in practice on the latest version of Spark i.e. Spark 2.0.

PySpark Tutorial - Learn to use Apache Spark with Python
PySpark Project-Get a handle on using Python with Spark through this hands-on data processing spark python tutorial.

Yelp Data Processing using Spark and Hive Part 2
In this spark project, we will continue building the data warehouse from the previous project Yelp Data Processing Using Spark And Hive Part 1 and will do further data processing to develop diverse data products.

Web Server Log Processing using Hadoop
In this hadoop project, you will be using a sample application log file from an application server to a demonstrated scaled-down server log processing pipeline.

Create A Data Pipeline Based On Messaging Using PySpark And Hive - Covid-19 Analysis
In this PySpark project, you will simulate a complex real-world data pipeline based on messaging. This project is deployed using the following tech stack - NiFi, PySpark, Hive, HDFS, Kafka, Airflow, Tableau and AWS QuickSight.

Analyse Yelp Dataset with Spark & Parquet Format on Azure Databricks
In this Databricks Azure project, you will use Spark & Parquet file formats to analyse the Yelp reviews dataset. As part of this you will deploy Azure data factory, data pipelines and visualise the analysis.

Implementing Slow Changing Dimensions in a Data Warehouse using Hive and Spark
Hive Project- Understand the various types of SCDs and implement these slowly changing dimesnsion in Hadoop Hive and Spark.

Real-Time Log Processing using Spark Streaming Architecture
In this Spark project, we are going to bring processing to the speed layer of the lambda architecture which opens up capabilities to monitor application real time performance, measure real time comfort with applications and real time alert in case of security

Hadoop Project for Beginners-SQL Analytics with Hive
In this hadoop project, learn about the features in Hive that allow us to perform analytical queries over large datasets.