How to read csv data from local system and replace and rename the columns?
BIG DATA RECIPES DATA CLEANING PYTHON DATA MUNGING MACHINE LEARNING RECIPES PANDAS CHEATSHEET     ALL TAGS

How to read csv data from local system and replace and rename the columns?

How to read csv data from local system and replace and rename the columns?

This recipe helps you read csv data from local system and replace and rename the columns

0

Recipe Objective

In most of the big data scenarios, data cleaning task is an integral part of a data pipeline in which the raw file is taken and most of the below cleaning operations are done on the data according to the use case. Without data cleaning being done, it's hard to standardize the structure of the data and store it into data lakes which will have a drastic impact on the business decisions on the data.

System requirements :

  • Install the python module as follows if the below modules are not found:
  • pip install requests
  • The below codes can be run in Jupyter notebook, or any python console

Step 1: Import the module

To import

import pandas as pd import datetime

Step 2 : Read the csv file

Read the csv file from the local and create a dataframe using pandas, and print the 5 lines to check the data.

df = pd.read_csv('stockdata.csv') df.head()

Output of the above code:

Step 3 : Change the date format

In this below code changing the date column format "28-February-2015" to 28-02-2015.

df['Date'] = pd.to_datetime(df['Date']) print(df['Date'])

Output of the above code:

Step 4 : Convert column names to lowercase

#To convert the column names lowercase df.columns.str.lower()

Output of the above code:

Step 5 : Replacing Empty spaces with underscore

In the code below we are going to replacing the empty spaces in the column names with underscore

# to put underscore in all columns ### Replacing Empty spaces with underscore df.columns.str.replace(' ','_')

Output of the above code:

Step 6 : Rename the column names

In the below we are reading column names from the dataframe and rename the column names to create a new dataframe. Print the columns to check :

df.columns df.head()

Output of the above code:

Output of the above code:

Create a new dataframe rename column names using rename method in key value format and print the 5 lines.

df2 = df.rename(columns = {'Date' :'stock_date', 'Open Price':'open_price', 'High Price':'high_price', 'Low Price':'low_price', 'Close Price':'close_price', 'WAP':'weighted_avg_price', 'No.of Shares':'num_of_shares', 'No. of Trades':'num_of_rades', 'Total Turnover (Rs.)':'tot_turnover_in_rupees','Deliverable Quantity':'delvry_quantity', '% Deli. Qty to Traded Qty':'delvry_qty_to_traded_qty', 'Spread High-Low':'spread_hi-low', 'Spread Close-Open':'spread_close-open'}, inplace = False) df2.head()

Output of the above code:

Step 7 : Check for Missing Values

To make detecting missing values, Pandas library provides the isnull() and notnull() functions, which are also methods on Series and DataFrame objects.True if the original value is null, False if the original value isn't null

df2.isnull()

Output of the above code:

Another way it can easily be done is by using the isnull function paired with the 'sum' function. it will give the count of null values from each column.

df2.isnull().sum()

Output of the above code:

Step 8 : Filling Missing Data

Pandas library provides different types of methods for cleaning the missing values. The fillna function can "fill in" NA values with non-null data in a couple of ways.

df2.close_price = df2.close_price.fillna(0) print(df2.close_price.isnull().sum())

Output of the above code: In the above code filling the null values with 0

Step 9 : Dropping missing data

The final thing you can do, is to delete the missing rows of data. We can delete all rows with any missing data as shown below.

df2= df2.dropna() df2.isnull().sum()

Other way: Use the drop to Remove rows or columns by specifying label names and corresponding axis data as shown below.

df2= df2.drop() df2.isnull().sum()

Step 10 : Create csv file

After renaming the column names write the formatted data into a csv file in your local or hdfs. # write formatted to csv df2.to_csv("stock_data.csv")

Output of the above lines:

Relevant Projects

Implementing Slow Changing Dimensions in a Data Warehouse using Hive and Spark
Hive Project- Understand the various types of SCDs and implement these slowly changing dimesnsion in Hadoop Hive and Spark.

Yelp Data Processing Using Spark And Hive Part 1
In this big data project, we will continue from a previous hive project "Data engineering on Yelp Datasets using Hadoop tools" and do the entire data processing using spark.

Event Data Analysis using AWS ELK Stack
This Elasticsearch example deploys the AWS ELK stack to analyse streaming event data. Tools used include Nifi, PySpark, Elasticsearch, Logstash and Kibana for visualisation.

Design a Hadoop Architecture
Learn to design Hadoop Architecture and understand how to store data using data acquisition tools in Hadoop.

Tough engineering choices with large datasets in Hive Part - 2
This is in continuation of the previous Hive project "Tough engineering choices with large datasets in Hive Part - 1", where we will work on processing big data sets using Hive.

Online Hadoop Projects -Solving small file problem in Hadoop
In this hadoop project, we are going to be continuing the series on data engineering by discussing and implementing various ways to solve the hadoop small file problem.

PySpark Tutorial - Learn to use Apache Spark with Python
PySpark Project-Get a handle on using Python with Spark through this hands-on data processing spark python tutorial.

Hadoop Project for Beginners-SQL Analytics with Hive
In this hadoop project, learn about the features in Hive that allow us to perform analytical queries over large datasets.

Airline Dataset Analysis using Hadoop, Hive, Pig and Impala
Hadoop Project- Perform basic big data analysis on airline dataset using big data tools -Pig, Hive and Impala.

Real-time Auto Tracking with Spark-Redis
Spark Project - Discuss real-time monitoring of taxis in a city. The real-time data streaming will be simulated using Flume. The ingestion will be done using Spark Streaming.