How to read csv data from local system and replace and rename the columns?

How to read csv data from local system and replace and rename the columns?

How to read csv data from local system and replace and rename the columns?

This recipe helps you read csv data from local system and replace and rename the columns

Recipe Objective

In most of the big data scenarios, data cleaning task is an integral part of a data pipeline in which the raw file is taken and most of the below cleaning operations are done on the data according to the use case. Without data cleaning being done, it's hard to standardize the structure of the data and store it into data lakes which will have a drastic impact on the business decisions on the data.

System requirements :

  • Install the python module as follows if the below modules are not found:
  • pip install requests
  • The below codes can be run in Jupyter notebook, or any python console

Step 1: Import the module

To import

import pandas as pd import datetime

Step 2 : Read the csv file

Read the csv file from the local and create a dataframe using pandas, and print the 5 lines to check the data.

df = pd.read_csv('stockdata.csv') df.head()

Output of the above code:

Step 3 : Change the date format

In this below code changing the date column format "28-February-2015" to 28-02-2015.

df['Date'] = pd.to_datetime(df['Date']) print(df['Date'])

Output of the above code:

Step 4 : Convert column names to lowercase

#To convert the column names lowercase df.columns.str.lower()

Output of the above code:

Step 5 : Replacing Empty spaces with underscore

In the code below we are going to replacing the empty spaces in the column names with underscore

# to put underscore in all columns ### Replacing Empty spaces with underscore df.columns.str.replace(' ','_')

Output of the above code:

Step 6 : Rename the column names

In the below we are reading column names from the dataframe and rename the column names to create a new dataframe. Print the columns to check :

df.columns df.head()

Output of the above code:

Output of the above code:

Create a new dataframe rename column names using rename method in key value format and print the 5 lines.

df2 = df.rename(columns = {'Date' :'stock_date', 'Open Price':'open_price', 'High Price':'high_price', 'Low Price':'low_price', 'Close Price':'close_price', 'WAP':'weighted_avg_price', 'No.of Shares':'num_of_shares', 'No. of Trades':'num_of_rades', 'Total Turnover (Rs.)':'tot_turnover_in_rupees','Deliverable Quantity':'delvry_quantity', '% Deli. Qty to Traded Qty':'delvry_qty_to_traded_qty', 'Spread High-Low':'spread_hi-low', 'Spread Close-Open':'spread_close-open'}, inplace = False) df2.head()

Output of the above code:

Step 7 : Check for Missing Values

To make detecting missing values, Pandas library provides the isnull() and notnull() functions, which are also methods on Series and DataFrame objects.True if the original value is null, False if the original value isn't null


Output of the above code:

Another way it can easily be done is by using the isnull function paired with the 'sum' function. it will give the count of null values from each column.


Output of the above code:

Step 8 : Filling Missing Data

Pandas library provides different types of methods for cleaning the missing values. The fillna function can "fill in" NA values with non-null data in a couple of ways.

df2.close_price = df2.close_price.fillna(0) print(df2.close_price.isnull().sum())

Output of the above code: In the above code filling the null values with 0

Step 9 : Dropping missing data

The final thing you can do, is to delete the missing rows of data. We can delete all rows with any missing data as shown below.

df2= df2.dropna() df2.isnull().sum()

Other way: Use the drop to Remove rows or columns by specifying label names and corresponding axis data as shown below.

df2= df2.drop() df2.isnull().sum()

Step 10 : Create csv file

After renaming the column names write the formatted data into a csv file in your local or hdfs. # write formatted to csv df2.to_csv("stock_data.csv")

Output of the above lines:

Relevant Projects

Event Data Analysis using AWS ELK Stack
This Elasticsearch example deploys the AWS ELK stack to analyse streaming event data. Tools used include Nifi, PySpark, Elasticsearch, Logstash and Kibana for visualisation.

Movielens dataset analysis for movie recommendations using Spark in Azure
In this Databricks Azure tutorial project, you will use Spark Sql to analyse the movielens dataset to provide movie recommendations. As part of this you will deploy Azure data factory, data pipelines and visualise the analysis.

Bitcoin Data Mining on AWS Free Tier
Bitcoin Mining on AWS - Learn how to use AWS Cloud for building a data pipeline and analysing bitcoin data.

Yelp Data Processing Using Spark And Hive Part 1
In this big data project, we will continue from a previous hive project "Data engineering on Yelp Datasets using Hadoop tools" and do the entire data processing using spark.

Tough engineering choices with large datasets in Hive Part - 1
Explore hive usage efficiently in this hadoop hive project using various file formats such as JSON, CSV, ORC, AVRO and compare their relative performances

Finding Unique URL's using Hadoop Hive
Hive Project -Learn to write a Hive program to find the first unique URL, given 'n' number of URL's.

Hadoop Project-Analysis of Yelp Dataset using Hadoop Hive
The goal of this hadoop project is to apply some data engineering principles to Yelp Dataset in the areas of processing, storage, and retrieval.

Online Hadoop Projects -Solving small file problem in Hadoop
In this hadoop project, we are going to be continuing the series on data engineering by discussing and implementing various ways to solve the hadoop small file problem.

Log Analytics Project with Spark Streaming and Kafka
In this spark project, you will use the real-world production logs from NASA Kennedy Space Center WWW server in Florida to perform scalable log analytics with Apache Spark, Python, and Kafka.

PySpark Tutorial - Learn to use Apache Spark with Python
PySpark Project-Get a handle on using Python with Spark through this hands-on data processing spark python tutorial.