1-844-696-6465 (US)        +91 77600 44484        help@dezyre.com

Data Analysis Workflow with R Packages

Any data analysis exercise starts with data procurement and ends with producing results in a most intuitive way possible. Before starting data analysis workflow in R, lets understand what we understand by term data and what are the typical sources of data procurement.

Data is nothing but information in any available form. Per day sales for a retail chain is data, images you post on social media sites is data, videos on YouTube are data. The message which we are trying to convey here is that data might not always be available in a form of Rows and Columns but most of the data is available in an unstructured format.

In this article we’ll mostly be focussed on structured data for data analysis workflow using best r packages but we’ll also touch upon sourcing unstructured data.

Data Analysis

Data analysis workflow has 4 basic components:

1) Importing Data 

First ingredient to your data analysis recipe is data. There are multiple ways to import data into R Session. You can read data through web, through databases or using local files. read.csv() & read.table() are two of the frequently used methods to read data from local files. You might also look into ODBC connections to connect multiple databases and directly use them into your work.

2) Data Manipulation

Raw data is not always in a good shape to work on. Many times it is needed to subset data, rollup data, aggregate data, missing value treatment or outlier treatment. There are various good R packages to do specific tasks, e.g., MICE package can directly be used to impute missing values, dplyr package can be used to do various data manipulations and reshape2 comes handy to change shape of your data.

Learn Data Science in R Programming

3) Data Visualization

Whether you are working on a descriptive or prescriptive project you will need visualization to help you out. Along with R Base plots ggplot2 and ggvis are two good R packages which are widely used by data scientists. GGVIS package is one of the best R packages which let you build interactive charts without much efforts.

Which is the most time consuming task for you when doing data analysis?

4) Data Reporting

Last part of workflow is to report data or findings. If you need an interactive report you might choose Shiny apps but if your objective is just to build a pdf or html page with static graphs your best choice will be RMarkdown.

CLICK HERE to get the Data Scientist Salary Report for 2016 delivered to your inbox!

If you would like more information about Data Science careers, please click the orange "Request Info" button on top of this page.

Importing Data in R Language

Let’s get started

When we start working in R it’s always a good practice to set your working directory. Working directory is nothing but a folder in which you can store your files. Working directory in R can be setup in two ways:

1) Using UI wizard:

Go to Session > Set Working Directory > Choose Directory or alternatively use Ctrl + Shift + H shortcut to select the working directory

Using UI Wizard in R Language

2) Using R command:

Assuming you are on a Windows machine, you can use any of the following commands to set your working directory:​


These commands will set “R” folder within user Dezyre. Note that R will not create a new folder if it doesn’t exist. In order to set a working directory, folder should pre-exist.

Let’s get data from Internet

R allows you to download data directly and use it for your analysis. download.file() is a very useful function which lets you to the job.

Assuming that we want to study house price index variation in India. We have two options: either we can download file from data.gov.in and then read it into R or directly source from web.

Following lines of code extract data directly from website and saves it for you.

url <-“ https://data.gov.in/sites/default/files/datafile/housing_price_index_2010-11_100.csv”
download.file(url, destfile = "housing_data.csv", method = "curl")

Firstly we saved URL into an object and passed that to download.file() function which then saved file housing_data.csv in current working directory.

Following two lines helps us to read data into current R session and visualize

housing_data <- read.csv("housing_data.csv")

Reading Data in R Language

Let’s read the local files

You don’t always need to download data from internet, you might have data loaded into your local machine which you would like to read.

In the last example, we downloaded housing_data into working directory and read it into R using read.csv() function

read.table() function is used to read local files. This function has following important attributes:

  • file: the path to the file you want to read (just file name in case of file exists in working directory)
  • header: tells R whether data contains header or not, by default it is FALSE
  • sep: how your data is separated, e.g., csv files are separated by comma
  • na.strings: tells R the strings which represents missing values
  • nrows: number of rows R should read. e.g., nrows=10 tells R to read 10 lines only
  • skip: tells R to skip n number of rows while reading
housing_data2 <- read.table(file = “housing_data.csv”, sep =”,”, header = TRUE, na.strings = ‘NA’)

Above line of code reads the downloaded data and stores it into housing_data2 object.


Viewing Housing Data after read.csv

read.csv() is a special case of read.table() which sets sep=”,” and header = TRUE by default.

Loading Files in R Language

You could also use import wizard located in workspace pan of R Studio to load files.

Data Manipulation in R Language

Look into your data

Once you have gotten data into R session, next task is to look for the summary of your data. This is done basically to see if the data imported, has right quality of data and follows the business .

One of the other primary tasks is to look for the data types, for example – we expect housing indexes to have numerical datatypes but if that’s not true, we’ll have to do conversion.

This can be achieved by using class() function on our dataset.

sapply(housing_data, class)

Usage of sapply Function

It says that all our variables are in numeric format as expected. Else we could use as.numeric() function to convert them into numeric values from character (Beware of factor values) datatype.

summary() function in R lets you see the five point summary of different variables within your dataset.

Let see how our data looks like?


summary() function usage explained

Based on the summary, our data looks good, there seems no missing values.

If your objective is to study relationship between two variables or a set of variables, you would like to see correlation between two variables.

PerformanceAnalytics package in R can be used to study relationship between variables. It gives you visual picture of association among variables.

Can data be manipulated?

Now you might need to manipulate your data, manipulation includes creating new variables, transforming data, treating missing values, treating outliers etc.

Missing value treatment is mostly context dependent but there are few great R packages which will help you to impute missing values. MICE for example is one of the best R packages which makes your life easy.

In order to move forward let’s assume that we are interested in housing index data from 2012 onwards, this can be achieved by writing-

housing_data_2012 <- housing_data[,c(1,5:11)]

Data Manipulation in R Language

Now, let’s assume if we want to transpose our data (rows to columns) this can be achieved by using t() function in R

housing_data_transpose <- as.data.frame(t(housing_data_2012[,-1]),row.names = F)
colnames(housing_data_transpose) <- col_names

Usage of Transpose t() function in R Language

We have used very dirty method to transpose data, but there are cool packages like reshape and reshape2 to do the job.

Next, we want to create a new variable which says how many records have crossed 150 housing index in a quarter-

housing_data_transpose$count_150 <- apply(housing_data_transpose, MARGIN=1, FUN=function(x) length(which(x[c(-1,-2)]>150)))

Notice that $ operator can be used to create a new variable as well. Here we have used apply function to go over each row, exclude 1st and 2nd entries (quarter and All India) and find out the number of occurrences where housing index has crossed 150. Apply family in R (apply, mapply, sapply, tapply, lapply etc.) are substitute to recursive loops.

Usage of apply family of functions in R Language

Data Visualization in R Language

Let’s visualize housing data

One of the major tasks in R workflow is to visualize your data. R has an advantage over other languages when it comes to data visualization.

Along with R Base plotting options, there are various good R packages, like ggplot2, ggvis to plot your data.

A boxplot gives you summary view of numerical data. Let see how does our housing indexes varied across quarters.


Data Visualization in R Language using Boxplot

Let see how did Housing Index prices progressed over quarters for All India.

plot(housing_data_transpose$`All India`)

Data Visualization in R Language-Learn to do plotting

Dots in above graph shows the values for various quarters.

plot(housing_data_transpose$`All India`, type = “l”)

Will plot the same graph in a line format-

Data Visualization in R Language using Line Graph

Let’s compare prices of different quarters for Chennai using ggplot2 package which is one of the best R packages

ggplot(housing_data_transpose, aes(quarter, Chennai, fill = Chennai)) + 
 	geom_bar(stat="identity", position="dodge")

Creating a Boxplot in R Language

In the above graph,X-axis is not in an intuitive order rather it is sorted in an alphabetical order. Work on changing the order of data and plot a right graph using ggplot. Let us know if you have been able to get it right.

Data Reporting in R Language

Reproducible Research

Many a times you need to transfer your work to your colleague, so that he/ she can use it again or many a times you just want to produce a report for your boss to look into.

There are multiple ways to report your work in R, depending on the requirement.

If we want to save our R code so that we can directly use it next time, we can save the code and just call it next time instead, of writing the code again from scratch.

We are going to save following lines of code in a file named as rprogram_1.R -

url <- “https://data.gov.in/sites/default/files/datafile/housing_price_index_2010-11_100.csv”
download.file(url, destfile = "housing_data.csv", method = "curl")

Next time if we want to read housing data, we need not write the above code again rather we have to just source the file and continue from there.

housing_data_new <- read.csv("housing_data.csv")

This is very useful when you write functions or modules which are frequently used in your work.

Sometimes you want not only the code but also the entire data to be saved. This can easily be done by saving the entire workspace and loading it later.

How to save the entire workspace in R UI Wizard

Using the save (Floppy icon) button in Environment you can save the entire workspace and load it using open button next time you visit R Studio.

If your requirement is to pass on your work in the form of a report you can use R Markdown which gives a good way to present, your work.

Creating an R Markdown File

Let’s create a new markdown file.

After creating your R Markdown file paste the following code into it-

title: "housing_data"
author: "Dezyre.com"
date: "15 May 2016"
output: html_document

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)

## R Markdown
This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see .
When you click the **Knit** button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. 

We will be using our housing_data example to produce this report.
You can embed an R code chunk like this:

url <- “https://data.gov.in/sites/default/files/datafile/housing_price_index_2010-11_100.csv”
download.file(url,destfile = "housing_data.csv",method = "curl")
housing_data <- read.csv("housing_data.csv")
We have also learnt to use read.table() function to read data
housing_data_2<-read.table("housing_data.csv", header = TRUE, sep = ",")
### Manipulation

Then we manipulated data a little bit

housing_data_2012 <- housing_data[,c(1,5:11)]
housing_data_transpose <- as.data.frame(t(housing_data_2012[,-1]),row.names = F)
colnames(housing_data_transpose) <- col_names

#Finding number of occurrences with Housing Index more than 150

housing_data_transpose$count_150<-apply(housing_data_transpose,MARGIN=1,FUN=function(x) length(which(x[c(-1,-2)]>150)))

### Plots
Plots are beautiful and easy to visualize or communicate through data

#Box Plot

Once you have created the RMarkdown file you can click on knit HTML and get the shareable HTML file, which looks like –





Learn how to do data analysis in Python

comments powered by Disqus