Is Predictive Modelling easier with R or with Python?

Is Predictive Modelling easier with R or with Python?

Is Predictive Modelling in Data Science easier with R or with Python? This is the most confusing question, for various data scientists when it comes to choosing R over Python or other way around.

Should I learn R or Python? Is R more accurate than Python? Will I get enough support if I use Python - are complementary questions which haunts a data scientist while selecting tools to build data products.

There is no direct answer to the question but it majorly depends on multiple factors e.g., what is your objective? Do you need visualizations etc.

If you would like more information about Data Science careers, please click the orange "Request Info" button on top of this page.

Python vs R for Predictive Modelling

In the subsequent part of the post, we will try to touch base on most of the points which will help you to make a better decision while choosing R Vs Python for predictive modelling.

Predictive Modelling

Which language, R or Python - has a strong community?

First, we will look into the possible help which you might get if you are stuck somewhere. It’s a very well-known fact that the R community is well built to develop, improve and answer anything related to ‘Predictive Modelling’ or any other statistical technique.

We did a small exercise -  we searched the following two strings in Google:

  1. Linear regression in R
  2. Linear regression in Python

Linear Regression in R

Linear Regression in Python

We can clearly see that Python community has contributed only 1.5% of the contribution made by R community for the ‘Linear Regression’ – which is a used for predictive modelling.


CLICK HERE to get the Data Scientist Salary Report for 2016 delivered to your inbox!


Which language is built for statistics?

R was primarily built to help data scientists to run complex data science algorithms while Python evolved as a general purpose programming language.

Practically, when it comes to Predictive Analytics or Machine Learning both languages have pretty good packages written.

Any analytics project related to Predictive Analytics is done in two phases:

  1. Model Building
  2. Real time prediction

As R was built only for data scientists and statisticians, it beats Python in first phase but the revolution of production system was concurrent to the evolution of Python, hence Python easily integrates with your production code written in other languages like Java or C++ etc.

Is R able to handle Big Data sets?

When R was developed, the concept of Big Data had not quite matured to the level it is at today. Data scientists or statisticians were able to handle the data and running Predictive Analytics using R which stores data in computers’ RAM.

This is one of the major drawbacks of R that it does just in-memory computations. R has evolved over time. Now you have server versions of R where you can install R on a server and run your machine algorithms or any other statistical analysis.

Apart from the option of server installation, R and Python - both have capability to connect to Hadoop HDFS and do parallel computing.

What about visualization?

You might be wondering that we have mentioned everything from support to complexity to production but we haven’t commented on the basic ingredient of data sciences i.e. Data Visualization.

Data Visualization is indeed the first part which is needed even before running your first iteration of the model. There are various examples where graphs can tell a story better than a machine learning algorithm.

As of today Python couldn’t compete with R when it comes to data visualization. ggplot is the best tool to use, which you will find in statistical data visualizations.

What if I want to examine my model thoroughly?

If you are valuing “Model Interpretability” over only “Accuracy of prediction” then Python will surely disappoint you there.

R assumes that your objective is “Statistical Learning” and tries to make it cooler for you to understand and diagnose the predictive model built by you.

Scikit-learn is the mostly used Python package for machine learning which helps you to tune your model or switch between different models but it’s hard to diagnose your model with Scikit-learn in Python.

Hey! You forgot the learning curve!

Most people find it difficult to code in R, general opinion being, that Python codes are easy to interpret as they look more or less like English language. Hence, learning curve of R is proven to be steeper than Python.

Python is easier to adapt for people with programming background using other languages like JAVA, FORTRAN, C++ etc.

To summarize the topics discussed above: -

Python vs R for Data Science

Learn Data Science in R Programming Language

Let’s look into an example using Predictive analytics in both the languages – Python and R.

If you have reached this part of the article, we have a small surprise for you. We’ll use linear regression example to understand the differences between both the languages when it comes to do the actual work of coding.

Before we go there, let me ask you a question. What is the most common used dataset when it comes to explain statistics using R? – The winner is “iris” dataset, which comes along with R installation.

Iris dataset is comprised of following variables:

  • Sepal.Length
  • Sepal.Width
  • Petal.Length
  • Petal.Width
  • Species

Find below a sample of the dataset:

Iris Dataset Sample

As you might be aware that linear regression is used to estimate continuous dependent variables using a set of independent variables.

In this example; let’s assume that we need to estimate “Petal.Width” using the remaining 3 variables. Basically, we are looking to establish some relationship in the following format:

Petal.Width = intercept + B1*Sepal.Length + B2*Sepal.Width + B3*Petal.Length

Let’s get started:

Step 1: Get your environment ready

Before building any Predictive Model using R or Python or any other language for that matter, we have to get our tools ready.

For a carpenter his tools might be chisel, hammer etc. but for a Data Scientist his tools are – Statistical Packages, Plotting packages etc. Let see, how both of them work

R Language

If we talk specifically about Linear Regression, Logistic Regression or some of the basic algorithms. R comes pre-loaded with those packages. But if you need to install a new package for your analysis:

install.packages(‘package name’)

require(package name)

That’s it. Now you can directly use functions defined within the package

Python Language

If you want to build a predictive model using Python, you will have to start importing packages for almost everything you want to do. For our example i.e. executing Predictive Analytics using OLS we need the following packages

import matplotlib.pyplot as plt

import numpy as np

import pandas as pd

from sklearn import datasets, linear_model

Step 2: Reading Data into your environment

Assuming that you have the data in a *.csv format in your local system, now we have to insert the data into R and Python. Importing data in both the languages is almost similar.

R Language

R has very good and pre-loaded function “read.csv” which can be used to import datasets into R environment.

model_data <- read.csv(“file.path\filename.csv”)

Python Language

In Python we need to use “Pandas” library to read the file.

model_data = pd.read_csv(‘file.path/filename.csv')

Step 3: Let’s slice the data

Before starting any modelling exercise or any Data Science task we should first look into data; How does data look like? How do my variables spread across? Are there any missing values or not? Etc.

R Language

Summary function of R is pretty handy to have a first-hand glance on what your data is made of? And also helps us to answer the questions which we raised above.

summary(dataset_name); This function gives the summary of data directly

Let see how does it work on our “iris” data

Iris Dataset Summary in R Language

The above summary basically tells us lots of information e.g.,“iris” dataset is comprised of 5 variables; Species variable is a categorical variable; there are no missing values in data etc.

Python Language

Similar to R, Python also has similar function to get the summary statistics for each of the variable.


Working with Iris Dataset in Python Language

So what did you observe (Apart from font beauty of Python?)

You can see that Python doesn’t give summary for categorical or qualitative variables. By default, panda’s “Describe” function works only on the numerical data type columns.

Step 4: Visualize Data

The next and very important task is to see what is the relationship between your dependent and independent variables? Both R and Python have pretty good functions to understand the relationships.

Over time, statisticians across the world have developed packages specific just to identify of the relationship between the variables which are very useful.

Step 5: Let’s Build our Model finally!

We have reached the stage where we’ll be building our linear regression model in both the languages and understand the results.

R Language

R comes preloaded with basic needs of a Data Science e.g., Linear Regression, Logistic Regression. We’ll be using the pre-loaded function lm() to run our linear regression model

fit<-lm( Petal.Width~Sepal.Length+Sepal.Width+Petal.Length,data=iris)

That’s it and you have successfully built your first Predictive Model using R.

Let see, how does it look like:

To see what got built use summary() function on the “fit”


Summary Function in R Language

Summary gives us a detailed look into different variables, there beta coefficients, significance levels etc.

Python Language

In order to build our model in Python we’ll be using statsmodels package

import statsmodels.formula.api as sm

lm = sm.ols(formula=' Petal.Width~Sepal.Length+Sepal.Width+Petal.Length’, data=iris).fit()

                Method to build your Predictive Model in Python is very similar to R without much changes.

Summary Function in Python Language

Final Takeaway:

If we want to summarize our post, we can say that

  • R community is much stronger than Python community
  • R was built specifically to help Data Science
  • Python can easily be integrated with other languages
  • There is no clear difference between both the languages which can answer the question, “Which language is easier for Predictive Modelling?”

Remember, if you know –

  • What Predictive Model you are going to build?
  • What is your objective?
  • And you have good command over Maths – There is no language which is easier than other! Rather, language is just a tool to assist you in your Data Science Journey.






Learn Data Science in Python

Relevant Projects

PySpark Tutorial - Learn to use Apache Spark with Python
PySpark Project-Get a handle on using Python with Spark through this hands-on data processing spark python tutorial.

Data Science Project on Wine Quality Prediction in R
In this R data science project, we will explore wine dataset to assess red wine quality. The objective of this data science project is to explore which chemical properties will influence the quality of red wines.

Learn to prepare data for your next machine learning project
Text data requires special preparation before you can start using it for any machine learning project.In this ML project, you will learn about applying Machine Learning models to create classifiers and learn how to make sense of textual data.

Natural language processing Chatbot application using NLTK for text classification
In this NLP AI application, we build the core conversational engine for a chatbot. We use the popular NLTK text classification library to achieve this.

Walmart Sales Forecasting Data Science Project
Data Science Project in R-Predict the sales for each department using historical markdown data from the Walmart dataset containing data of 45 Walmart stores.

German Credit Dataset Analysis to Classify Loan Applications
In this data science project, you will work with German credit dataset using classification techniques like Decision Tree, Neural Networks etc to classify loan applications using R.

Resume parsing with Machine learning - NLP with Python OCR and Spacy
In this machine learning resume parser example we use the popular Spacy NLP python library for OCR and text classification.

Predict Churn for a Telecom company using Logistic Regression
Machine Learning Project in R- Predict the customer churn of telecom sector and find out the key drivers that lead to churn. Learn how the logistic regression model using R can be used to identify the customer churn in telecom dataset.

Predict Employee Computer Access Needs in Python
Data Science Project in Python- Given his or her job role, predict employee access needs using amazon employee database.

Build an Image Classifier for Plant Species Identification
In this machine learning project, we will use binary leaf images and extracted features, including shape, margin, and texture to accurately identify plant species using different benchmark classification techniques.