How to split data into train set and test set in R?

How to split data into train set and test set in R?

How to split data into train set and test set in R?

This recipe helps you split data into train set and test set in R


Recipe Objective

In supervides learning alogorithms such as Linear Regression, Logistic Regression and Decision Trees, tt is very crucial to split the data into training and testing sets. We first train the model using the observations in the traning dataset and then use this model to predict from testing dataset. ​

The purpose of splitting is to avoid overfitting i.e. paying attention to minor details/noise which are not necessary and only optimizes the training dataset accuracy. In the end, we need a model which can perform well on unseen data so we keep use the test data in the very end to test the trained model performance. ​

In this recipe, you will learn how to split the data into training and testing dataset. ​

Reading the data and importing required packages

We use "caTools" package to get the required function "sample.split()" to split the dataset. ​

# installing and loading catools package install.packages("caTools") library(caTools) # creating a dataframe after reading the data from a csv file data_1 = read.csv("R_232_Data_1.csv") head(data_1)
Cost	Weight	Weight1	Length	Height	Width
242	23.2	25.4	30.0	11.5200	4.0200
290	24.0	26.3	31.2	12.4800	4.3056
340	23.9	26.5	31.1	12.3778	4.6961
363	26.3	29.0	33.5	12.7300	4.4555
430	26.5	29.0	34.0	12.4440	5.1340
450	26.8	29.7	34.7	13.6024	4.9274

STEP 2: Splitting the dataset into Train and test data

We use sample.split() and subset() function to do so.

Syntax: sample.split(Y = , SplitRatio = )


  1. Y = target variable
  2. SplitRatio = no of train observation divided by the total number of test observation. for eg. SplitRatio for 70%:30% (Train:Test) is 0.7. The observations are chosen randomly.
ind = sample.split(Y = data_1$Cost, SplitRatio = 0.7) #subsetting into Train data train = data_1[ind,] #subsetting into Test data test = data_1[!ind,]

Now, checking the dimensions of the train and test data created so check whether this worked or not

dim(train) dim(test)
111	6
48	6

Relevant Projects

Perform Time series modelling using Facebook Prophet
In this project, we are going to talk about Time Series Forecasting to predict the electricity requirement for a particular house using Prophet.

Forecast Inventory demand using historical sales data in R
In this machine learning project, you will develop a machine learning model to accurately forecast inventory demand based on historical sales data.

Ensemble Machine Learning Project - All State Insurance Claims Severity Prediction
In this ensemble machine learning project, we will predict what kind of claims an insurance company will get. This is implemented in python using ensemble machine learning algorithms.

Predict Credit Default | Give Me Some Credit Kaggle
In this data science project, you will predict borrowers chance of defaulting on credit loans by building a credit score prediction model.

Identifying Product Bundles from Sales Data Using R Language
In this data science project in R, we are going to talk about subjective segmentation which is a clustering technique to find out product bundles in sales data.

NLP and Deep Learning For Fake News Classification in Python
In this project you will use Python to implement various machine learning methods( RNN, LSTM, GRU) for fake news classification.

Census Income Data Set Project - Predict Adult Census Income
Use the Adult Income dataset to predict whether income exceeds 50K yr based on census data.

Zillow’s Home Value Prediction (Zestimate)
Data Science Project in R -Build a machine learning algorithm to predict the future sale prices of homes.

Time Series Forecasting with LSTM Neural Network Python
Deep Learning Project- Learn to apply deep learning paradigm to forecast univariate time series data.

Predict Churn for a Telecom company using Logistic Regression
Machine Learning Project in R- Predict the customer churn of telecom sector and find out the key drivers that lead to churn. Learn how the logistic regression model using R can be used to identify the customer churn in telecom dataset.