How to fix the date issues in the data using pandas and write the corrected data to csv?

This recipe helps you fix the date issues in the data using pandas and write the corrected data to csv

Recipe Objective

In most of the big data scenarios , there will be a requirement to use a date module for fixing the date issues like incorrect date , flipping the date format and changing the date formats and correct them based on certain criteria and flag the incorrect dates and fix them appropriately. This module is re-usable and can be used to fix most of the frequently occurring date issues in your big data pipeline.

Master the Art of Data Cleaning in Machine Learning

System requirements :

  • Installpython modules as follows: pip install pandas pip install numpy
  • The below codes can be run in Jupyter notebook , or any python console
  • In this scenario we are going to use two excel files which have consequent months of data with various date fields and we fix them accordingly feb_order_dataset , jan_order_dataset

Step 1: Import the modules

In this example we are going to use the pandas, numpy and datetime modules these modules are used for data manipulation pandas data structures and operations for manipulating numerical tables and time series.

import pandas as pd import datetime import numpy as np

Step 2: function to flip month and date

def day_month_flip(date_to_flip): return pd.to_datetime(date_to_flip.strftime('%Y-%d-%m %H:%M:%S'))

Step 3: Write conditions to check the date issues

In the code below we wrote conditions to check the Incorrect dates,Flipped dates, New date values compared to previous file's data Invalid date format And add three new columns for new date, flag for new date, status of new date to the output file which can be added to the pipeline which can use the corrected date column for further ingestion.

def check_condtn1(date_to_check, month_val): if date_to_check > month_val: if date_to_check.day <= 12: flipped_date = day_month_flip(date_to_check) print('$$', flipped_date) if flipped_date <= month_val: return [flipped_date, 1] else: return [date_to_check, 2] else: return [date_to_check, 2] else: return [date_to_check, 0] def check_condtn2(date_to_check, date_to_check_prev, month_val): if date_to_check != date_to_check_prev: if date_to_check.day <= 12: flipped_date = day_month_flip(date_to_check) if flipped_date == date_to_check_prev and flipped_date <= month_val: return [flipped_date, 1] else: return [date_to_check, 2] else: return [date_to_check, 2] else: return [date_to_check, 0]

Step 4 : Checking new date value

In this below code we are compare the old order date with new order_date

def check_new_value(a, x, y): month_last = '2020-02-28' month_val = pd.to_datetime(month_last) if a == 'New Date Value Found': if pd.to_datetime(x) > pd.to_datetime(y) or pd.to_datetime(x) > month_val: flip = day_month_flip(x) if flip <= pd.to_datetime(y) and flip <= month_val: return (flip, 'flipping solved the issue') else: return ('', 'by flipping issue not solved') else: return ('', '')

In this below code , It is a main method to takes order_dates from the excel sheets, using the above conditions , using same columns in the sheets comparing the dates with previous dates, then it checks the conditions for date issues and if the conditions are true, it takes dates into list and then creates the new columns, it will print the status of date issue in the status column.

def date_check(dates, safe_prev, month_val): print(dates,safe_prev) date_list = [] final_date = [] doc = dates[0] for i in range(1): if pd.isnull(dates[i]) or (isinstance(dates[i], datetime.date) == False): date_list.extend([dates[i], 3]) elif isinstance(dates[i], datetime.date): date_list.append(check_condtn1(dates[i], month_val)) if safe_prev.empty == False and isinstance(safe_prev.values[0][i], datetime.date): date_list.append(check_condtn2(dates[i], safe_prev.values[0][i], month_val)) elif safe_prev.empty == False and pd.isnull(safe_prev.values[0][i]) and isinstance(dates[i], datetime.date): print('catch') print([dates[i], 4]) date_list.append([dates[i], 4]) else: date_list.append([dates[i], 0]) # segregating the combined results into separate list to their respective date lisdoc_list = [date_list[i] for i in [0]] doc_list = [date_list[i] for i in [0,1]] print('date_list',date_list) codes = [date_list[i][1] for i in range(0,2)] if 2 in codes or 3 in codes: flag = 'Issue in one of the date' else: flag = 0 # Assigning the value for codes so that it can be populated in status column for lists in [doc_list]: list_code = [lists[i][1] for i in range(2)] if 2 in list_code: for cond_result in lists: if cond_result[1] == 2: final_date.extend([cond_result[0], 'Date has issue']) break elif sum(list_code) == 0: for cond_result in lists: if cond_result[1] == 0: final_date.extend([cond_result[0], 'Date is correct']) break elif 3 in list_code: for cond_result in lists: if cond_result[1] == 3: final_date.extend([cond_result[0], 'Invalid Date format']) break elif 4 in list_code: for cond_result in lists: if cond_result[1] == 4: print('second_catch') final_date.extend([cond_result[0], 'New Date Value Found']) break else: for cond_result in lists: if cond_result[1] == 1: final_date.extend([cond_result[0], 'Flipping fixed the issue']) break final_date.append(flag) return pd.Series(final_date)

Step 5 : Reading the Data from excel sheets from local /HDFS

In this code below we are reading the data from excel sheets of the latest data and old data and creating the new columns and fixed dates issues and status of them writing the file to the HDFS/ local.

def execute(): # Importing rawdata data = pd.read_excel('feb_order_data.xlsx') df = data # Reading previous month data month_last = '2020-02-28' month_last = pd.to_datetime(month_last) data_prev = pd.read_excel('jan_order_data.xlsx') date_cols = ['order_date', 'request_id', 'order_no'] # print(list(data_prev)) if data_prev.empty: data_prev = pd.DataFrame(columns=date_cols) data_comp = data_prev[date_cols] for i in date_cols: df[i] = df[i].apply(lambda x: pd.to_datetime(x, errors='ignore') if type(x) == str or type(x) == bytes else x) data_comp[i] = data_comp[i].apply(lambda x: pd.to_datetime(x, errors='ignore') if type(x) == str or type(x) == bytes else x) data_comp.replace({pd.NaT: np.nan}, inplace=True) new_cols = ['order_date_new', 'order_date_status', 'status'] print('check before') date_df = df.apply(lambda x: date_check(x[date_cols],data_comp[(data_comp['order_no'] == x[date_cols][2]) & (data_comp['request_id'] == x[date_cols][1])], month_last), axis=1) print('check after') date_df.columns = new_cols print('new_data columns', date_df.columns) df = pd.concat([df, date_df], axis=1) print('after append', list(df)) df = df.replace(r'\\n', ' ', regex=True) df = df.replace(r'\\N', ' ', regex=True) df = df.replace(r'\n', ' ', regex=True) df = df.replace(r'\r', ' ', regex=True) # writing the cleaned file to hdfs print(list(data)) df.to_excel('resultof_fixeddates.xlsx', encoding = 'utf-8') execute() The complete code is as follows : import pandas as pd import datetime import numpy as np # function to flip the month and date def day_month_flip(date_to_flip): return pd.to_datetime(date_to_flip.strftime('%Y-%d-%m %H:%M:%S')) def check_condtn1(date_to_check, month_val): if date_to_check > month_val: if date_to_check.day <= 12: flipped_date = day_month_flip(date_to_check) print('$$', flipped_date) if flipped_date <= month_val: return [flipped_date, 1] else: return [date_to_check, 2] else: return [date_to_check, 2] else: return [date_to_check, 0] def check_condtn2(date_to_check, date_to_check_prev, month_val): if date_to_check != date_to_check_prev: if date_to_check.day <= 12: flipped_date = day_month_flip(date_to_check) if flipped_date == date_to_check_prev and flipped_date <= month_val: return [flipped_date, 1] else: return [date_to_check, 2] else: return [date_to_check, 2] else: return [date_to_check, 0] def check_new_value(a, x, y): month_last = '2020-02-28' month_val = pd.to_datetime(month_last) if a == 'New Date Value Found': if pd.to_datetime(x) > pd.to_datetime(y) or pd.to_datetime(x) > month_val: flip = day_month_flip(x) if flip <= pd.to_datetime(y) and flip <= month_val: return (flip, 'flipping solved the issue') else: return ('', 'by flipping issue not solved') else: return ('', '') def date_check(dates, safe_prev, month_val): print(dates,safe_prev) date_list = [] final_date = [] doc = dates[0] for i in range(1): if pd.isnull(dates[i]) or (isinstance(dates[i], datetime.date) == False): date_list.extend([dates[i], 3]) elif isinstance(dates[i], datetime.date): date_list.append(check_condtn1(dates[i], month_val)) if safe_prev.empty == False and isinstance(safe_prev.values[0][i], datetime.date): date_list.append(check_condtn2(dates[i], safe_prev.values[0][i], month_val)) elif safe_prev.empty == False and pd.isnull(safe_prev.values[0][i]) and isinstance(dates[i], datetime.date): print('catch') print([dates[i], 4]) date_list.append([dates[i], 4]) else: date_list.append([dates[i], 0]) # segregating the combined results into separate list to their respective date lisdoc_list = [date_list[i] for i in [0]] doc_list = [date_list[i] for i in [0,1]] print('date_list',date_list) codes = [date_list[i][1] for i in range(0,2)] if 2 in codes or 3 in codes: flag = 'Issue in one of the date' else: flag = 0 # Assigning the value for codes so that it can be populated in status column for lists in [doc_list]: list_code = [lists[i][1] for i in range(2)] if 2 in list_code: for cond_result in lists: if cond_result[1] == 2: final_date.extend([cond_result[0], 'Date has issue']) break elif sum(list_code) == 0: for cond_result in lists: if cond_result[1] == 0: final_date.extend([cond_result[0], 'Date is correct']) break elif 3 in list_code: for cond_result in lists: if cond_result[1] == 3: final_date.extend([cond_result[0], 'Invalid Date format']) break elif 4 in list_code: for cond_result in lists: if cond_result[1] == 4: print('second_catch') final_date.extend([cond_result[0], 'New Date Value Found']) break else: for cond_result in lists: if cond_result[1] == 1: final_date.extend([cond_result[0], 'Flipping fixed the issue']) break final_date.append(flag) return pd.Series(final_date) def execute(): # Importing rawdata data = pd.read_excel('feb_order_data.xlsx') df = data # Reading previous month data month_last = '2020-02-28' month_last = pd.to_datetime(month_last) data_prev = pd.read_excel('jan_order_data.xlsx') date_cols = ['order_date', 'request_id', 'order_no'] # print(list(data_prev)) if data_prev.empty: data_prev = pd.DataFrame(columns=date_cols) data_comp = data_prev[date_cols] for i in date_cols: df[i] = df[i].apply(lambda x: pd.to_datetime(x, errors='ignore') if type(x) == str or type(x) == bytes else x) data_comp[i] = data_comp[i].apply(lambda x: pd.to_datetime(x, errors='ignore') if type(x) == str or type(x) == bytes else x) data_comp.replace({pd.NaT: np.nan}, inplace=True) new_cols = ['order_date_new', 'order_date_status', 'status'] print('check before') date_df = df.apply(lambda x: date_check(x[date_cols],data_comp[(data_comp['order_no'] == x[date_cols][2]) & (data_comp['request_id'] == x[date_cols][1])], month_last), axis=1) print('check after') date_df.columns = new_cols print('new_data columns', date_df.columns) df = pd.concat([df, date_df], axis=1) print('after append', list(df)) df = df.replace(r'\\n', ' ', regex=True) df = df.replace(r'\\N', ' ', regex=True) df = df.replace(r'\n', ' ', regex=True) df = df.replace(r'\r', ' ', regex=True) # writing the cleaned file to hdfs print(list(data)) df.to_excel('resultof_fixeddates.xlsx', encoding = 'utf-8') execute()

Output of the above code is an excel file which will be written to current location of execution of the code and it looks like below :

What Users are saying..

profile image

Savvy Sahai

Data Science Intern, Capgemini
linkedin profile url

As a student looking to break into the field of data engineering and data science, one can get really confused as to which path to take. Very few ways to do it are Google, YouTube, etc. I was one of... Read More

Relevant Projects

Build Classification Algorithms for Digital Transformation[Banking]
Implement a machine learning approach using various classification techniques in Python to examine the digitalisation process of bank customers.

Build CI/CD Pipeline for Machine Learning Projects using Jenkins
In this project, you will learn how to create a CI/CD pipeline for a search engine application using Jenkins.

Build a Review Classification Model using Gated Recurrent Unit
In this Machine Learning project, you will build a classification model in python to classify the reviews of an app on a scale of 1 to 5 using Gated Recurrent Unit.

Recommender System Machine Learning Project for Beginners-1
Recommender System Machine Learning Project for Beginners - Learn how to design, implement and train a rule-based recommender system in Python

Azure Text Analytics for Medical Search Engine Deployment
Microsoft Azure Project - Use Azure text analytics cognitive service to deploy a machine learning model into Azure Databricks

Learn to Build an End-to-End Machine Learning Pipeline - Part 1
In this Machine Learning Project, you will learn how to build an end-to-end machine learning pipeline for predicting truck delays, addressing a major challenge in the logistics industry.

Build a Churn Prediction Model using Ensemble Learning
Learn how to build ensemble machine learning models like Random Forest, Adaboost, and Gradient Boosting for Customer Churn Prediction using Python

Build a Multi-Class Classification Model in Python on Saturn Cloud
In this machine learning classification project, you will build a multi-class classification model in Python on Saturn Cloud to predict the license status of a business.

Multilabel Classification Project for Predicting Shipment Modes
Multilabel Classification Project to build a machine learning model that predicts the appropriate mode of transport for each shipment, using a transport dataset with 2000 unique products. The project explores and compares four different approaches to multilabel classification, including naive independent models, classifier chains, natively multilabel models, and multilabel to multiclass approaches.

Build a Graph Based Recommendation System in Python -Part 1
Python Recommender Systems Project - Learn to build a graph based recommendation system in eCommerce to recommend products.