Importing Data from Flat Files in R

Setting up the R Environment

R is a language and software environment for Statistical Computing. The Data being imported into R would be mostly a variation of spreadsheet-like text file. The easiest form of data to import into R is through a text file. When starting with new session in R it is better to remove unused objects which are probably saved over from the Previous R Session. This command lists out all the objects in the current environment:

> ls()    # this will display the objects in the current environment.

To remove unused objects, Object1 and Object2 use the following command:

> rm(list = c(‘Object1’, ‘Object2’))    # remove specific objects

To remove all objects from memory use the following commands:

> rm(list = ls())    # remove all objects

The reason to follow this routine is that R saves objects in memory. Large files take up a lot of memory in R. Before importing data from files it better to free existing memory from unused objects. In your working directory create a folder named Data and in this folder create a RandomFile.txt with the following content. The column values would be separated by whitespaces.

Id   Name
1   Raj
2   Ravi
3   Tom

Before importing this file into a data frame object one needs to verify the current working directory in R Environment. To view the current working directory.

> getwd()    # get current working directory

If this working directory differs from our project folder. We need to set the current working directory in the R Environment with the following command

> setwd(‘’)    # set current working directory

Importing Data from File

The principal function for reading data into R is the read.table function. There are other convenience functions like read.csv and read.delim that provide arguments to read.table appropriate for CSV and tab-delimited files. read.table function: Reads a file and creates a data frame from it. For small size file, you can call the read.table function by just specifying the file argument. R will pick up default values of arguments and load the data frame.

> data <- read.table(‘Data/RandomFile.txt’)    # create data frame object

To print the data.frame object created by reading the RandomFile.txt the code is:

> print(data)    # display the data frame details

On executing the above code we get the following result:

            V1        V2
1          Id         Name
2          1          Raj
3          2          Ravi
4          3          Tom

There are two issues that are apparent with the data frame and that is

The first line has not been considered as a header
The class of the columns has been taken default as factors instead of taking the first column as numeric and the second column as character

To test the class of the data frame object. Run the below code:

> class(data[, 1])    # get the class of the first column
[1] “factor”

> class(data[, 2])    # get the class of the second column
[1] “factor”

Get Closer To Your Dream of Becoming a Data Scientist with 70+ Solved End-to-End ML Projects

To rectify this error we need to inspect the arguments of the read.table function. To get help on the description, arguments and usage of read.table function type the following command in the R Environment.

> ?read.table    # display the description and usage of the function

Generally the important arguments for read.table function are

file, the name of the file
header, True if the file has a header line and false otherwise
sep, a string defining how the files are separated
colClasses, a character vector that indicates the class of the columns
nrows, numeric value for the number of rows
skip, numeric value of the number of lines to skip from the Beginning
stringAsFactors, True if character columns are to be considered as factors and False otherwise.

Based on the above description the modified code for the read.table function that would correctly set the Header and the classes of the columns would be:

> data <- read.table('Data/RandomFile.txt'
                  , header = T
                  , colClasses = c("numeric", "character"))

Now when we can verify that the proper arguments have rectified the errors with the following command.

> class(data[, 1])    # get the class of the first column
[1] “numeric”

> class(data[, 2])    # get the class of the second column
[1] “character”

Get FREE Access to Machine Learning Example Codes for Data Cleaning, Data Munging, and Data Visualization

The read.csv function is identical to the read.table except the default separator is comma. That is sep = “,”. Similarly the read.delim is also identical except the default separator is tab. That is sep = “\t”. To use read.csv functions one would have to create an additional txt file RandomFile1.txt where the column values would be separated by comma. RandomFile1.txt would have the following content.

Id,Name
1,Raj
2,Ravi
3,Tom

The read.csv function would be able to load this file into a data frame with a very similar function call as the read.table.

> data <- read.csv('Data/RandomFile1.txt'
                  , header = T
                  , colClasses = c("numeric", "character"))

Recommended Tutorials:

Performance Issues with the read.table function in the Utils Package:

Since R loads the data frame into memory there are a number of tips to improve performance.

Remove commented code from text file and Set comment.char = “” in the file
Set the colClasses with the expected variable type of the Column
Set the nrows column

As a general rule R will use twice the size of the file to load the data frame into memory.For example if the RandomFile.txt has a file size of 100 MB then one can expect the Memory Utilisation to be between 80MB to 200MB depending on the column type. The read.table, read.csv, read.delim are designed to create data frames which may have column of different classes and they are not the right tool for reading large matrices. These functions use a surprisingly large amount of memory when reading large files.

The readr package:

Hadley Wickham and the RStudio team have created the readr package which provides replacement functions for read.table family of functions in R. The readr package provides additional functionality and greater speed to the existing utils package in R. In tests conducted the readr package have been proved to be 10 to 30 times faster than the utils package. It also has a helpful progress bar to indicate percentage complete. The code to call the read_table, read_csv function from the readr package is as follows:

> library(“readr”)    # load the readr package in the Environment
> data <- read_table(‘Data/RandomFile.txt’)    # create data frame object
> class(data[, 1])    # get the class of the first column
[1] “numeric”

> class(data[, 2])    # get the class of the second column
[1] “character”

From the above code one would have observed that without specifying additional arguments the read_table function was able to correctly identify the header and the types of the columns. In addition to this the function call is much faster than the read.table function. The read_csv and the read_delim function is very similar to the read_table function with the exception being the separator. In the case of read_csv the text file columns should be separated by “,” and in case of read_delim the text file should be separated by “\t”.

Learn Data Science by working on interesting Data Science Projects

The data.table package

Matt Dowle has created the data.table package with the goal of reducing programming and compute time. The data.table package allows you to do fast data manipulations while working with large datasets. The data.table package has a number of operations such as selection. grouping, chaining, setting keys which are extremely fast compared to the data frame. The syntax structure of the data.table class in the data.table package is different from the data.frame class. The data.table class inherits from the data.frame class in R. Even though the data.table class has a completely different syntax to the data.frame class. A data.table object can be passed to any package that only accepts data.frame and that package can use the data.frame syntax on the data.table object.

The below code demonstrates how the existing data frame object is converted into a data.table object.

> library(“data.table”)    # load the data.table package in the Environment
> DT <- data.table(data)    # create a data.table object from a data.frame

To get a summary on the data.table object call the following command:

> tables() # create a data.table object from a data.frame
      NAME NROW NCOL MB COLS  KEY
      [1,] DT     3   2  1    Id,Name   
Total: 1MB

To view the column types of the data.table object use the following command:

> sapply(DT, class) # Apply the class function to each column
     Id          Name
  "integer" "character"

The fread function in the data.table package is similar to the read.table function but much faster and more convenient. The fread function returns an object of class data.table by default. It returns a data.frame object when the argument data.table is set to False. The fread function accepts only regular delimited files that is each row should have the same number of columns. The following code demonstrates how to use the fread function to import a text file.

> DT <- fread(“Data/RandomFile.txt”) # load the data.table object

Calling the tables and the sapply function on the DT object results in similar data.table object.

Access Data Science and Machine Learning Project Code Examples