Setting up the R Environment
R is a language and software environment for Statistical Computing. The Data being imported into R would be mostly a variation of spreadsheet-like text file. The easiest form of data to import into R is through a text file. When starting with new session in R it is better to remove unused objects which are probably saved over from the Previous R Session. This command lists out all the objects in the current environment:
> ls() # this will display the objects in the current environment.
To remove unused objects, Object1 and Object2 use the following command:
> rm(list = c(‘Object1’, ‘Object2’)) # remove specific objects
To remove all objects from memory use the following commands:
> rm(list = ls()) # remove all objects
The reason to follow this routine is that R saves objects in memory. Large files take up a lot of memory in R. Before importing data from files it better to free existing memory from unused objects. In your working directory create a folder named Data and in this folder create a RandomFile.txt with the following content. The column values would be separated by whitespaces.
Id Name 1 Raj 2 Ravi 3 Tom
Before importing this file into a data frame object one needs to verify the current working directory in R Environment. To view the current working directory.
> getwd() # get current working directory
If this working directory differs from our project folder. We need to set the current working directory in the R Environment with the following command
’) # set current working directory
Importing Data from File
The principal function for reading data into R is the read.table function. There are other convenience functions like read.csv and read.delim that provide arguments to read.table appropriate for CSV and tab-delimited files. read.table function: Reads a file and creates a data frame from it. For small size file, you can call the read.table function by just specifying the file argument. R will pick up default values of arguments and load the data frame.
> data <- read.table(‘Data/RandomFile.txt’) # create data frame object
To print the data.frame object created by reading the RandomFile.txt the code is:
> print(data) # display the data frame details
On executing the above code we get the following result:
V1 V2 1 Id Name 2 1 Raj 3 2 Ravi 4 3 Tom
There are two issues that are apparent with the data frame and that is
- The first line has not been considered as a header
- The class of the columns has been taken default as factors instead of taking the first column as numeric and the second column as character
To test the class of the data frame object. Run the below code:
> class(data[, 1]) # get the class of the first column  “factor” > class(data[, 2]) # get the class of the second column  “factor”
To rectify this error we need to inspect the arguments of the read.table function. To get help on the description, arguments and usage of read.table function type the following command in the R Environment.
> ?read.table # display the description and usage of the function
Generally the important arguments for read.table function are
- file, the name of the file
- header, True if the file has a header line and false otherwise
- sep, a string defining how the files are separated
- colClasses, a character vector that indicates the class of the columns
- nrows, numeric value for the number of rows
- skip, numeric value of the number of lines to skip from the Beginning
- stringAsFactors, True if character columns are to be considered as factors and False otherwise.
Based on the above description the modified code for the read.table function that would correctly set the Header and the classes of the columns would be:
> data <- read.table('Data/RandomFile.txt' , header = T , colClasses = c("numeric", "character"))
Now when we can verify that the proper arguments have rectified the errors with the following command.
> class(data[, 1]) # get the class of the first column  “numeric” > class(data[, 2]) # get the class of the second column  “character”
The read.csv function is identical to the read.table except the default separator is comma. That is sep = “,”. Similarly the read.delim is also identical except the default separator is tab. That is sep = “\t”. To use read.csv functions one would have to create an additional txt file RandomFile1.txt where the column values would be separated by comma. RandomFile1.txt would have the following content.
Id,Name 1,Raj 2,Ravi 3,Tom
The read.csv function would be able to load this file into a data frame with a very similar function call as the read.table.
> data <- read.csv('Data/RandomFile1.txt' , header = T , colClasses = c("numeric", "character"))
Performance Issues with the read.table function in the Utils Package:
Since R loads the data frame into memory there are a number of tips to improve performance.
- Remove commented code from text file and Set comment.char = “” in the file
- Set the colClasses with the expected variable type of the Column
- Set the nrows column
As a general rule R will use twice the size of the file to load the data frame into memory.For example if the RandomFile.txt has a file size of 100 MB then one can expect the Memory Utilisation to be between 80MB to 200MB depending on the column type. The read.table, read.csv, read.delim are designed to create data frames which may have column of different classes and they are not the right tool for reading large matrices. These functions use a surprisingly large amount of memory when reading large files.
The readr package:
Hadley Wickham and the RStudio team have created the readr package which provides replacement functions for read.table family of functions in R. The readr package provides additional functionality and greater speed to the existing utils package in R. In tests conducted the readr package have been proved to be 10 to 30 times faster than the utils package. It also has a helpful progress bar to indicate percentage complete. The code to call the read_table, read_csv function from the readr package is as follows:
> library(“readr”) # load the readr package in the Environment > data <- read_table(‘Data/RandomFile.txt’) # create data frame object > class(data[, 1]) # get the class of the first column  “numeric” > class(data[, 2]) # get the class of the second column  “character”
From the above code one would have observed that without specifying additional arguments the read_table function was able to correctly identify the header and the types of the columns. In addition to this the function call is much faster than the read.table function. The read_csv and the read_delim function is very similar to the read_table function with the exception being the separator. In the case of read_csv the text file columns should be separated by “,” and in case of read_delim the text file should be separated by “\t”.
The data.table package
Matt Dowle has created the data.table package with the goal of reducing programming and compute time. The data.table package allows you to do fast data manipulations while working with large datasets. The data.table package has a number of operations such as selection. grouping, chaining, setting keys which are extremely fast compared to the data frame. The syntax structure of the data.table class in the data.table package is different from the data.frame class. The data.table class inherits from the data.frame class in R. Even though the data.table class has a completely different syntax to the data.frame class. A data.table object can be passed to any package that only accepts data.frame and that package can use the data.frame syntax on the data.table object.
The below code demonstrates how the existing data frame object is converted into a data.table object.
> library(“data.table”) # load the data.table package in the Environment > DT <- data.table(data) # create a data.table object from a data.frame
To get a summary on the data.table object call the following command:
> tables() # create a data.table object from a data.frame NAME NROW NCOL MB COLS KEY [1,] DT 3 2 1 Id,Name Total: 1MB
To view the column types of the data.table object use the following command:
> sapply(DT, class) # Apply the class function to each column Id Name "integer" "character"
The fread function in the data.table package is similar to the read.table function but much faster and more convenient. The fread function returns an object of class data.table by default. It returns a data.frame object when the argument data.table is set to False. The fread function accepts only regular delimited files that is each row should have the same number of columns. The following code demonstrates how to use the fread function to import a text file.
> DT <- fread(“Data/RandomFile.txt”) # load the data.table object
Calling the tables and the sapply function on the DT object results in similar data.table object.