R Hadoop – A perfect match for Big Data

Understand how R and Hadoop can be integrated together for big data analytics using tools like Rhadoop, RHIVE, RHIPE and Hadoop Streaming.

Get access to all Big Data Projects View all Big Data Projects

Last Updated: 11 Apr 2024 | BY ProjectPro

When people talk about big data analytics and Hadoop, they think about using technologies like Pig, Hive, and Impala as the core tools for data analysis. However, if you discuss these tools with data scientists or data analysts, they say that their primary and favourite tool when working with big data sources and Hadoop, is the open source statistical modelling language – R. R programming language is the preferred choice amongst data analysts and data scientists because of its rich ecosystem catering to the essential ingredients of a big data project- data preparation, analysis and correlation tasks.

Hadoop Project-Analysis of Yelp Dataset using Hadoop Hive

Downloadable solution code | Explanatory videos | Tech Support

Start Project

R and Hadoop were not natural friends but with the advent of novel packages like Rhadoop, RHIVE, and RHIPE- the two seemingly different technologies, complement each other for big data analytics and visualization. Hadoop is the go-to big data technology for storing large quantities of data at economical costs and R programming language is the go-to data science tool for statistical data analysis and visualization. R and Hadoop combined together prove to be an incomparable data crunching tool for some serious big data analytics for business.

Big Data Analytics with R and Hadoop

Most Hadoop users, often pose this question – “What is the best way to integrate R and Hadoop together for big data analytics.” The answer to this depends on various factors like size of the dataset, skills, budget, governance limitations, etc. This post summarizes the various ways to use R and Hadoop together to perform big data analytics for achieving scalability, stability and speed.

Why use R on Hadoop?

Why use R on Hadoop?

Analytical Power of R + Storage and Processing Power of Hadoop =Ideal Solution for Big Data Analytics

R is an amazing data science programming tool to run statistical data analysis on models and translating the results of analysis into colourful graphics. There is no doubt that R is the most preferred programming tool for statisticians, data scientists, data analysts and data architects but it falls short when working with large datasets. One major drawback with R programming language is that all objects are loaded into the main memory of a single machine. Large datasets of size petabytes cannot be loaded into the RAM memory; this is when Hadoop integrated with R language, is an ideal solution. To adapt to the in-memory, single machine limitation of R programming language, data scientists have to limit their data analysis to a sample of data from the large data set. This limitation of R programming language comes as a major hindrance when dealing with big data. Since, R is not very scalable, the core R engine can process only limited amount of data.

To the contrary, distributed processing frameworks like Hadoop are scalable for complex operations and tasks on large datasets (petabyte range) but do not have strong statistical analytical capabilities. As Hadoop is a popular framework for big data processing, integrating R with Hadoop is the next logical step. Using R on Hadoop will provide highly scalable data analytics platform which can be scaled depending on the size of the dataset. Integrating Hadoop with R lets data scientists run R in parallel on large dataset as none of the data science libraries in R language will work on a dataset that is larger than its memory. Big Data analytics with R and Hadoop competes with the cost value return offered by commodity hardware cluster for vertical scaling.

New Projects

R for Big Data

R finds the following applications in the field of big data:

R can be used for the purpose of exploratory data analysis. The term exploratory data analysis was minted in the field of data analysis using R. Exploratory data analysis is an approach that involves several techniques such as the identification and extraction of important variables from data, testing of underlying assumptions, and drawing insights from the datasets. R may be used to perform both simple and complex mathematical calculations and statistical analysis on various data objects.
Data visualization is made simple with R since it provides several inbuilt plotting commands that help create simple and complex graphs. The package ggplot2 allows users to add, remove or alter components to a plot and provides a coherent system for building graphs. R makes data visualization and data representation very easy and attractive with its graphic libraries. R provides support for many forms of graphic representations varying from concise charts to interactive graphic capabilities. It is said to be one of the most versatile data visualization packages.
In the finance and banking sectors, R is used for fraud detection. It is also used to help in reducing customer churn rates based on customer data analysis. Future business decisions can be made using the results of data analysis performed using R.
In the field of bioinformatics, R is used to analyze strands of genetic sequences and identify patterns in genomes. R is used in performing drug discovery and also finds applications in the field of computational neuroscience.
Analysts in social media companies use R to identify potential customers through targeted online advertising. Developers in social media companies use R to perform behavior and sentiment analysis to generate recommendation engines and keep customers engaged.
R makes data visualization and data representation very easy and attractive with its graphic libraries. R provides support for many forms of graphic representations varying from concise charts to interactive graphic capabilities. It is said to be one of the most versatile data visualization packages.
R has the ability to handle structured and unstructured data and can be integrated with multiple formats of data storage. R provides a variety of tools, including Oracle, Open Database Connectivity Protocol, and RmySQL, which allow it to interface with databases. There is also an extensive library of tools that can be utilized for database manipulation and wrangling.
R is able to seamlessly integrate with some data processing technologies such as Apache Hadoop and Apache Spark. Spark clusters can be used to remotely process large datasets using R. R and Hadoop work well together where Hadoop’s large scale data processing ability along with its distributed computing capabilities go well with R’s statistical computing abilities.

Methods of Integrating R and Hadoop Together

Data analysts or data scientists working with Hadoop might have R packages or R scripts that they use for data processing. To use these R scripts or R packages with Hadoop, they need to rewrite these R scripts in Java programming language or any other language that implements Hadoop MapReduce. This is a burdensome process and could lead to unwanted errors. To integrate Hadoop with R programming language, we need to use a software that already is written for R language with the data being stored on the distributed storage Hadoop. There are many solutions for using R language to perform large computations but all these solutions require that the data be loaded into the memory before it is distributed to the computing nodes. This is not an ideal solution for large datasets. Here are some commonly used methods to integrate Hadoop with R to make the best use of the analytical capabilities of R for large datasets-

Get FREE Access to Data Analytics Example Codes for Data Cleaning, Data Munging, and Data Visualization

1) RHADOOP –Install R on Workstations and Connect to Data in Hadoop

The most commonly used open source analytics solution to integrate R programming language with Hadoop is RHadoop. RHadoop developed by Revolution Analytics lets users directly ingest data from HBase database subsystems and HDFS file systems. Rhadoop package is the ‘go-to’ solution for using R on Hadoop because of its simplicity and cost advantage. Rhadoop is a collection of 5 different packages which allows Hadoop users to manage and analyse data using R programming language. RHadoop package is compatible with open source Hadoop and as well with popular Hadoop distributions- Cloudera, Hortonworks and MapR.

rhbase – rhbase package provides database management functions for HBase within R using Thrift server. This package needs to be installed on the node that will run R client. Using rhbase, data engineers and data scientists can read, write and modify data stored in HBase tables from within R.
rhdfs –rhdfs package provides R programmers with connectivity to the Hadoop distributed file system so that they read, write or modify the data stored in Hadoop HDFS.
plyrmr – This package supports data manipulation operations on large datasets managed by Hadoop. Plyrmr (plyr for MapReduce) provides data manipulation operations present in popular packages like reshape2 and plyr. This package depends on Hadoop MapReduce to perform operations but abstracts most of the MapReduce details.
ravro –This package lets users read and write Avro files from local and HDFS file systems.
rmr2 (Execute R inside Hadoop MapReduce) – Using this package, R programmers can perform statistical analysis on the data stored in a Hadoop cluster. Using rmr2 might be a cumbersome process to integrate R with Hadoop but many R programmers find using rmr2 much easier than depending on Java based Hadoop mappers and reducers. rmr2 might be a little tedious but it eliminates data movement and helps parallelize computation to handle large datasets.

2) RHIPE – Execute R inside Hadoop Map Reduce

RHIPE (“R and Hadoop Integrated Programming Environment”) is an R library that allows users to run Hadoop MapReduce jobs within R programming language. R programmers just have to write R map and R reduce functions and the RHIPE library will transfer them and invoke the corresponding Hadoop Map and Hadoop Reduce tasks. RHIPE uses a protocol buffer encoding scheme to transfer the map and reduce inputs. The advantage of using RHIPE over other parallel R packages is, that it integrates well with Hadoop and provides a data distribution scheme using HDFS across a cluster of machines - which provides fault tolerance and optimizes processor usage.

The RHIPE setup consists of three main parts: a remote computer, one or more R-session Unix servers and a Unix cluster running Hadoop. The Unix server and the Hadoop Unix cluster will be running R and RHIPE. Developers can work on the remote computer and log in to one of the R-session servers. This may be referred to as the home base where all the programming of RHIPE commands is done. R commands that a developer writes for division, analytic methods or recombination meant for the Hadoop cluster get passed along by the RHIPE commands.

The R-session servers can be separate from the Hadoop cluster servers or can be a part of the servers on the Hadoop cluster. Suppose the R-session server is on the Hadoop cluster. In that case, it is necessary to take some precautions in the Hadoop configuration to protect the R session programming so that the RHIPE Hadoop jobs do not end up competing with the R sessions. Here, one step that can be taken is to mount a file server on the cluster that contains all the files associated with the R session, including the .RData and file, and files that are read or written by R. Even with the precautions, it is not possible to fully guarantee that the RHIPE Hadoop jobs will not compete with the R sessions, so the safest bet is to separate the R-session servers. RStudio is very commonly used in the R community and can be installed on one of the R-session servers.

Remote computers have to be maintained by the users. The remote computer is essentially just a communication device and can run on any operating system. The SSH protocol is a standard protocol used by remote computers to communicate with the R-session servers and the Hadoop cluster. SSH is primarily used for logging into a remote machine to execute commands and/or transfer files. In this case, SSH supports both the R session command-line window with both the input and the output and a separate window showing graphics.

The Hadoop cluster is responsible for carrying out the data analysis. The R commands are given to RHIPE, which get passed along to Hadoop, and the outputs get written by Hadoop to HDFS. When the analysis is performed on large and complex data, it is often the case that either a relatively small dataset is generated from the outputs of a recombination method and/or the output may have to be further processed. In some cases, the outputs may be small enough to be analyzed in the remote computer. RHIPE provides support to write the outputs from HDFS to the R global environment of the R session.

Here's what valued users are saying about ProjectPro

ProjectPro is an awesome platform that helps me learn much hands-on industrial experience with a step-by-step walkthrough of projects. There are two primary paths to learn: Data Science and Big Data. In each learning path, there are many customized projects with all the details from the beginner to...

Jingwei Li

Graduate Research assistance at Stony Brook University

As a student looking to break into the field of data engineering and data science, one can get really confused as to which path to take. Very few ways to do it are Google, YouTube, etc. I was one of them too, and that's when I came across ProjectPro while watching one of the SQL videos on the...

Savvy Sahai

Data Science Intern, Capgemini

Not sure what you are looking for?

View All Projects

The outputs become a dataset in the .RDATA file.

In Hadoop, the two primary computational operations are Map and Reduce. Map performs the analytic method computation by running parallel computations on subsets of the main job without any communication among the subsets. Reduce takes the outputs from the Map computations and runs the recombination computations. Division can be carried out by Map and Reduce and can be a part of the reading of the data into R at the beginning of data analysis. Using Map and Reduce in Hadoop also involves using key-value pairs.

Consider this example:

The R code instructs a Map operation to put a key on each subset output. This results in a key-value pair, where the value is the output. Each output may be associated with a unique key; all outputs may be the same key, or one key can be associated with multiple outputs. When the Reduce operation is to be performed, it will assemble the key-value pairs based on keys, which results in groups. The R recombination code then gets applied to the values of each group independently, so the code runs on the different groups in parallel. In this manner, the recombination method is provided with substantial flexibility.

Get confident to build end-to-end projects

Access to a curated library of 250+ end-to-end industry projects with solution code, videos and tech support.

Request a demo

3) R and Hadoop Streaming

Hadoop Streaming API allows users to run Hadoop MapReduce jobs with any executable script that reads data from standard input and writes data to standard output as mapper or reducer. Thus, Hadoop Streaming API can be used along R programming scripts in the map or reduce phases. This method to integrate R, Hadoop does not require any client side integration because streaming jobs are launched through Hadoop command line. MapReduce jobs submitted undergo data transformation through UNIX standard streams and serialization to ensure Java complaint input to Hadoop, irrespective of the language of the input script provided by the programmer.

The below syntax can be used to run MapReduce codes written in R for data processing using the Hadoop MapReduce framework.

$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar
-input InputDirLocation \
-output OutputDirLocation \
-mapper /bin/cat \
-reducer /usr/bin/wc

Where:

InputDirLocation - location of input directory for map function

OutputDirLocation - location of output directory for reduce function

/bin/cat \ - the R executable script for map function

/usr/bin/wc - the R executable script for reduce function

Hadoop Streaming works in the following manner:

The executables to the mapper and reducer functions are scripts that read the input from stdin line-by-line and generate the output to stdout.
Hadoop Streaming creates a Map/Reduce job and submits it to a cluster, meanwhile monitoring the job progress until it gets completed.
Each mapper task launches the R script specified for the mappers as a separate process when the mapper is initialized.
The mapper task takes the input as key-value pairs and converts it into lines, and then pushes these transformed lines as the standard input to the process. The mapper collects the outputs from the standard output, which are now line-oriented and converts them to key-value pairs. The key-value pairs are collected as the result of the mapper.
Each reducer task launches the R reducer script specified as a separate process when the reducer gets initialized.
The reducer runs, taking the input key-value pairs and converting them into lines. The lines then get fed to the standard input of the process.

Get More Practice, More Big Data and Analytics Projects, and More guidance.Fast-Track Your Career Transition with ProjectPro

4) RHIVE –Install R on Workstations and Connect to Data in Hadoop

If you want your Hive queries to be launched from R interface then RHIVE is the go-to package with functions for retrieving metadata like database names, column names, and table names from Apache Hive. RHIVE provides rich statistical libraries and algorithms available in R programming language to the data stored in Hadoop by extending HiveQL with R language functions. RHIVE functions allow users to apply R statistical learning models to the data stored in Hadoop cluster that has been catalogued using Apache Hive. The advantage of using RHIVE for Hadoop R integration is that it parallelizes operations and avoids data movement because data operations are pushed down into Hadoop.

5) ORCH – Oracle Connector for Hadoop

ORCH can be used on non-oracle Hadoop clusters or on any other Oracle big appliance. Mappers and Reducers are written in R programming language and MapReduce jobs are executed from the R environments through a high level interface. With ORCH for R Hadoop integration, R programmers do not have to learn a new programming language like Java for getting into the details of Hadoop environment like Hadoop Cluster hardware or software. ORCH connector also allows users to test the ability of MapReduce programs locally, through the same function call, much before they are deployed to the Hadoop cluster.

The number of open source options for performing big data analytics with R and Hadoop is continuously expanding but for simple Hadoop MapReduce jobs, R and Hadoop Streaming still proves to be the best solution. The combination of R and Hadoop together is a must have toolkit for professionals working with big data to create fast, predictive analytics combined with performance, scalability and flexibility you need.

Build an Awesome Job Winning Project Portfolio with Solved End-to-End Big Data Projects

Most Hadoop users claim that the advantage of using R programming language is its exhaustive list of data science libraries for statistics and data visualization. However, the data science libraries in R language are non-distributed in nature which makes data retrieval a time consuming affair. However, this is an in-built limitation of R programming language, but if we just ignore it, then R and Hadoop together can make big data analytics an ecstasy!

ProjectPro

ProjectPro is the only online platform designed to help professionals gain practical, hands-on experience in big data, data engineering, data science, and machine learning related technologies. Having over 270+ reusable project templates in data science and big data with step-by-step walkthroughs,

Meet The Author