Importing Data from Web
R is a versatile platform for importing data from web, be it in the form a downloadable file from a webpage or a table in a HTML document. Consider a scenario when a concerned website is continually updating a certain dataset of importance to you, now instead of downloading and saving that file into .csv every time, you can run this command and get the update in your local system. In the example mentioned below we are using the data from National Centre of Environmental Information available at this link (http://www.ospo.noaa.gov/data/land/bbep2/biomass_burning.txt). The data is about blended polar geo biomass burning emissions records. Using R, we can use the read.csv function to import this .txt file from internet. The argument for read.csv function, will be the URL of the data.
>file <- “ http://www.ospo.noaa.gov/data/land/bbep2/biomass_burning.txt? filename=LocalFile.txt.gz&dir=C:/R/Data ” >Biomass_Burning_Data <- read.csv(file, header=TRUE)
This was an examples of how to download the data from .txt file on Internet into R. But sometimes we come across tables in HTML format on a website. If you wish to download those tables and analyse them, then R has the capacity to read through HTML document and import the tables that you want. The term Web Scraping is used for such a method of data importing from web.
The example mentioned below is used to extract the data of Oil Production Output by Countries (http://www.globalfirepower.com/oil-production-by-country.asp). We have used XML library for importing data into R. The data extracted from HTML based tables will be cleansed (removal of redundant columns and stray characters) before it can used. If the data is to be imported from an index local file, then replace the URL with the filename.
>library(XML) >url <- “http://www.globalfirepower.com/oil-production-by-country.asp” >oil_production_data = readHTMLTable(url, which=2)
If the data that is to be imported is an XML content, then the function xmltToDataFrame() should be used with argument as URL of the web page with data. In the example shown below, we are using the open source data available at ARCGIS about vegetation map for the islands of the Commonwealth of the Northern Marine Islands.
>url <- “http://opendata.arcgis.com/datasets/aade6a582a1641078cda28eab3fda344” >vegetation_data <-xmlToDataFrame(url)
If you wish to download the XML content data from JSON format, then you will have to use the rjson package in R. We will use the same data source for this example.
>library(rjson) >url <- “http://opendata.arcgis.com/datasets/aade6a582a1641078cda28eab3fda344?outFormat=json” >raw_data_json <- scan(url, “”, sep=”\n”) >vegetation_data <- fromJSON(raw_data_json)