How to read csv data from local system and replace and rename the columns?

How to read csv data from local system and replace and rename the columns?

How to read csv data from local system and replace and rename the columns?

This recipe helps you read csv data from local system and replace and rename the columns


Recipe Objective

In most of the big data scenarios, data cleaning task is an integral part of a data pipeline in which the raw file is taken and most of the below cleaning operations are done on the data according to the use case. Without data cleaning being done, it's hard to standardize the structure of the data and store it into data lakes which will have a drastic impact on the business decisions on the data.

System requirements :

  • Install the python module as follows if the below modules are not found:
  • pip install requests
  • The below codes can be run in Jupyter notebook, or any python console

Step 1: Import the module

To import

import pandas as pd import datetime

Step 2 : Read the csv file

Read the csv file from the local and create a dataframe using pandas, and print the 5 lines to check the data.

df = pd.read_csv('stockdata.csv') df.head()

Output of the above code:

Step 3 : Change the date format

In this below code changing the date column format "28-February-2015" to 28-02-2015.

df['Date'] = pd.to_datetime(df['Date']) print(df['Date'])

Output of the above code:

Step 4 : Convert column names to lowercase

#To convert the column names lowercase df.columns.str.lower()

Output of the above code:

Step 5 : Replacing Empty spaces with underscore

In the code below we are going to replacing the empty spaces in the column names with underscore

# to put underscore in all columns ### Replacing Empty spaces with underscore df.columns.str.replace(' ','_')

Output of the above code:

Step 6 : Rename the column names

In the below we are reading column names from the dataframe and rename the column names to create a new dataframe. Print the columns to check :

df.columns df.head()

Output of the above code:

Output of the above code:

Create a new dataframe rename column names using rename method in key value format and print the 5 lines.

df2 = df.rename(columns = {'Date' :'stock_date', 'Open Price':'open_price', 'High Price':'high_price', 'Low Price':'low_price', 'Close Price':'close_price', 'WAP':'weighted_avg_price', 'No.of Shares':'num_of_shares', 'No. of Trades':'num_of_rades', 'Total Turnover (Rs.)':'tot_turnover_in_rupees','Deliverable Quantity':'delvry_quantity', '% Deli. Qty to Traded Qty':'delvry_qty_to_traded_qty', 'Spread High-Low':'spread_hi-low', 'Spread Close-Open':'spread_close-open'}, inplace = False) df2.head()

Output of the above code:

Step 7 : Check for Missing Values

To make detecting missing values, Pandas library provides the isnull() and notnull() functions, which are also methods on Series and DataFrame objects.True if the original value is null, False if the original value isn't null


Output of the above code:

Another way it can easily be done is by using the isnull function paired with the 'sum' function. it will give the count of null values from each column.


Output of the above code:

Step 8 : Filling Missing Data

Pandas library provides different types of methods for cleaning the missing values. The fillna function can "fill in" NA values with non-null data in a couple of ways.

df2.close_price = df2.close_price.fillna(0) print(df2.close_price.isnull().sum())

Output of the above code: In the above code filling the null values with 0

Step 9 : Dropping missing data

The final thing you can do, is to delete the missing rows of data. We can delete all rows with any missing data as shown below.

df2= df2.dropna() df2.isnull().sum()

Other way: Use the drop to Remove rows or columns by specifying label names and corresponding axis data as shown below.

df2= df2.drop() df2.isnull().sum()

Step 10 : Create csv file

After renaming the column names write the formatted data into a csv file in your local or hdfs. # write formatted to csv df2.to_csv("stock_data.csv")

Output of the above lines:

Relevant Projects

Event Data Analysis using AWS ELK Stack
This Elasticsearch example deploys the AWS ELK stack to analyse streaming event data. Tools used include Nifi, PySpark, Elasticsearch, Logstash and Kibana for visualisation.

Airline Dataset Analysis using Hadoop, Hive, Pig and Impala
Hadoop Project- Perform basic big data analysis on airline dataset using big data tools -Pig, Hive and Impala.

Implementing Slow Changing Dimensions in a Data Warehouse using Hive and Spark
Hive Project- Understand the various types of SCDs and implement these slowly changing dimesnsion in Hadoop Hive and Spark.

Real-Time Log Processing in Kafka for Streaming Architecture
The goal of this apache kafka project is to process log entries from applications in real-time using Kafka for the streaming architecture in a microservice sense.

Movielens dataset analysis for movie recommendations using Spark in Azure
In this Databricks Azure tutorial project, you will use Spark Sql to analyse the movielens dataset to provide movie recommendations. As part of this you will deploy Azure data factory, data pipelines and visualise the analysis.

Tough engineering choices with large datasets in Hive Part - 1
Explore hive usage efficiently in this hadoop hive project using various file formats such as JSON, CSV, ORC, AVRO and compare their relative performances

Spark Project -Real-time data collection and Spark Streaming Aggregation
In this big data project, we will embark on real-time data collection and aggregation from a simulated real-time system using Spark Streaming.

Data Warehouse Design for E-commerce Environments
In this hive project, you will design a data warehouse for e-commerce environments.

Real-Time Log Processing using Spark Streaming Architecture
In this Spark project, we are going to bring processing to the speed layer of the lambda architecture which opens up capabilities to monitor application real time performance, measure real time comfort with applications and real time alert in case of security

Analysing Big Data with Twitter Sentiments using Spark Streaming
In this big data spark project, we will do Twitter sentiment analysis using spark streaming on the incoming streaming data.