Is Predictive Modelling in Data Science easier with R or with Python? This is the most confusing question, for various data scientists when it comes to choosing R over Python or other way around.
Should I learn R or Python? Is R more accurate than Python? Will I get enough support if I use Python - are complementary questions which haunts a data scientist while selecting tools to build data products.
There is no direct answer to the question but it majorly depends on multiple factors e.g., what is your objective? Do you need visualizations etc.
If you would like more information about Data Science careers, please click the orange "Request Info" button on top of this page.
In the subsequent part of the post, we will try to touch base on most of the points which will help you to make a better decision while choosing R Vs Python for predictive modelling.
First, we will look into the possible help which you might get if you are stuck somewhere. It’s a very well-known fact that the R community is well built to develop, improve and answer anything related to ‘Predictive Modelling’ or any other statistical technique.
We did a small exercise - we searched the following two strings in Google:
We can clearly see that Python community has contributed only 1.5% of the contribution made by R community for the ‘Linear Regression’ – which is a used for predictive modelling.
CLICK HERE to get the Data Scientist Salary Report for 2016 delivered to your inbox!
R was primarily built to help data scientists to run complex data science algorithms while Python evolved as a general purpose programming language.
Practically, when it comes to Predictive Analytics or Machine Learning both languages have pretty good packages written.
Any analytics project related to Predictive Analytics is done in two phases:
As R was built only for data scientists and statisticians, it beats Python in first phase but the revolution of production system was concurrent to the evolution of Python, hence Python easily integrates with your production code written in other languages like Java or C++ etc.
When R was developed, the concept of Big Data had not quite matured to the level it is at today. Data scientists or statisticians were able to handle the data and running Predictive Analytics using R which stores data in computers’ RAM.
This is one of the major drawbacks of R that it does just in-memory computations. R has evolved over time. Now you have server versions of R where you can install R on a server and run your machine algorithms or any other statistical analysis.
Apart from the option of server installation, R and Python - both have capability to connect to Hadoop HDFS and do parallel computing.
You might be wondering that we have mentioned everything from support to complexity to production but we haven’t commented on the basic ingredient of data sciences i.e. Data Visualization.
Data Visualization is indeed the first part which is needed even before running your first iteration of the model. There are various examples where graphs can tell a story better than a machine learning algorithm.
As of today Python couldn’t compete with R when it comes to data visualization. ggplot is the best tool to use, which you will find in statistical data visualizations.
If you are valuing “Model Interpretability” over only “Accuracy of prediction” then Python will surely disappoint you there.
R assumes that your objective is “Statistical Learning” and tries to make it cooler for you to understand and diagnose the predictive model built by you.
Scikit-learn is the mostly used Python package for machine learning which helps you to tune your model or switch between different models but it’s hard to diagnose your model with Scikit-learn in Python.
Most people find it difficult to code in R, general opinion being, that Python codes are easy to interpret as they look more or less like English language. Hence, learning curve of R is proven to be steeper than Python.
Python is easier to adapt for people with programming background using other languages like JAVA, FORTRAN, C++ etc.
To summarize the topics discussed above: -
If you have reached this part of the article, we have a small surprise for you. We’ll use linear regression example to understand the differences between both the languages when it comes to do the actual work of coding.
Before we go there, let me ask you a question. What is the most common used dataset when it comes to explain statistics using R? – The winner is “iris” dataset, which comes along with R installation.
Iris dataset is comprised of following variables:
Find below a sample of the dataset:
As you might be aware that linear regression is used to estimate continuous dependent variables using a set of independent variables.
In this example; let’s assume that we need to estimate “Petal.Width” using the remaining 3 variables. Basically, we are looking to establish some relationship in the following format:
Petal.Width = intercept + B1*Sepal.Length + B2*Sepal.Width + B3*Petal.Length
Let’s get started:
Before building any Predictive Model using R or Python or any other language for that matter, we have to get our tools ready.
For a carpenter his tools might be chisel, hammer etc. but for a Data Scientist his tools are – Statistical Packages, Plotting packages etc. Let see, how both of them work
If we talk specifically about Linear Regression, Logistic Regression or some of the basic algorithms. R comes pre-loaded with those packages. But if you need to install a new package for your analysis:
That’s it. Now you can directly use functions defined within the package
If you want to build a predictive model using Python, you will have to start importing packages for almost everything you want to do. For our example i.e. executing Predictive Analytics using OLS we need the following packages
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn import datasets, linear_model
Assuming that you have the data in a *.csv format in your local system, now we have to insert the data into R and Python. Importing data in both the languages is almost similar.
R has very good and pre-loaded function “read.csv” which can be used to import datasets into R environment.
model_data <- read.csv(“file.path\filename.csv”)
In Python we need to use “Pandas” library to read the file.
model_data = pd.read_csv(‘file.path/filename.csv')
Before starting any modelling exercise or any Data Science task we should first look into data; How does data look like? How do my variables spread across? Are there any missing values or not? Etc.
Summary function of R is pretty handy to have a first-hand glance on what your data is made of? And also helps us to answer the questions which we raised above.
summary(dataset_name); This function gives the summary of data directly
Let see how does it work on our “iris” data
The above summary basically tells us lots of information e.g.,“iris” dataset is comprised of 5 variables; Species variable is a categorical variable; there are no missing values in data etc.
Similar to R, Python also has similar function to get the summary statistics for each of the variable.
So what did you observe (Apart from font beauty of Python?)
You can see that Python doesn’t give summary for categorical or qualitative variables. By default, panda’s “Describe” function works only on the numerical data type columns.
The next and very important task is to see what is the relationship between your dependent and independent variables? Both R and Python have pretty good functions to understand the relationships.
Over time, statisticians across the world have developed packages specific just to identify of the relationship between the variables which are very useful.
We have reached the stage where we’ll be building our linear regression model in both the languages and understand the results.
R comes preloaded with basic needs of a Data Science e.g., Linear Regression, Logistic Regression. We’ll be using the pre-loaded function lm() to run our linear regression model
That’s it and you have successfully built your first Predictive Model using R.
Let see, how does it look like:
To see what got built use summary() function on the “fit”
Summary gives us a detailed look into different variables, there beta coefficients, significance levels etc.
In order to build our model in Python we’ll be using statsmodels package
import statsmodels.formula.api as sm
lm = sm.ols(formula=' Petal.Width~Sepal.Length+Sepal.Width+Petal.Length’, data=iris).fit()
Method to build your Predictive Model in Python is very similar to R without much changes.
If we want to summarize our post, we can say that
Remember, if you know –