How to split data into train set and test set in R?

This recipe helps you split data into train set and test set in R

Recipe Objective

In supervides learning alogorithms such as Linear Regression, Logistic Regression and Decision Trees, tt is very crucial to split the data into training and testing sets. We first train the model using the observations in the traning dataset and then use this model to predict from testing dataset. ​

The purpose of splitting is to avoid overfitting i.e. paying attention to minor details/noise which are not necessary and only optimizes the training dataset accuracy. In the end, we need a model which can perform well on unseen data so we keep use the test data in the very end to test the trained model performance. ​

In this recipe, you will learn how to split the data into training and testing dataset. ​

STEP 1: Reading the data and importing required packages

We use "caTools" package to get the required function "sample.split()" to split the dataset. ​

# installing and loading catools package install.packages("caTools") library(caTools) # creating a dataframe after reading the data from a csv file data_1 = read.csv("R_232_Data_1.csv") head(data_1)

Cost	Weight	Weight1	Length	Height	Width
242	23.2	25.4	30.0	11.5200	4.0200
290	24.0	26.3	31.2	12.4800	4.3056
340	23.9	26.5	31.1	12.3778	4.6961
363	26.3	29.0	33.5	12.7300	4.4555
430	26.5	29.0	34.0	12.4440	5.1340
450	26.8	29.7	34.7	13.6024	4.9274

STEP 2: Splitting the dataset into Train and test data

We use sample.split() and subset() function to do so.

Syntax: sample.split(Y = , SplitRatio = )

Where:

  1. Y = target variable
  2. SplitRatio = no of train observation divided by the total number of test observation. for eg. SplitRatio for 70%:30% (Train:Test) is 0.7. The observations are chosen randomly.

ind = sample.split(Y = data_1$Cost, SplitRatio = 0.7) #subsetting into Train data train = data_1[ind,] #subsetting into Test data test = data_1[!ind,]

Now, checking the dimensions of the train and test data created so check whether this worked or not

dim(train) dim(test)

111	6
48	6

What Users are saying..

profile image

Ameeruddin Mohammed

ETL (Abintio) developer at IBM
linkedin profile url

I come from a background in Marketing and Analytics and when I developed an interest in Machine Learning algorithms, I did multiple in-class courses from reputed institutions though I got good... Read More

Relevant Projects

Classification Projects on Machine Learning for Beginners - 2
Learn to implement various ensemble techniques to predict license status for a given business.

Build a Speech-Text Transcriptor with Nvidia Quartznet Model
In this Deep Learning Project, you will leverage transfer learning from Nvidia QuartzNet pre-trained models to develop a speech-to-text transcriptor.

Deep Learning Project- Real-Time Fruit Detection using YOLOv4
In this deep learning project, you will learn to build an accurate, fast, and reliable real-time fruit detection system using the YOLOv4 object detection model for robotic harvesting platforms.

Create Your First Chatbot with RASA NLU Model and Python
Learn the basic aspects of chatbot development and open source conversational AI RASA to create a simple AI powered chatbot on your own.

NLP Project for Multi Class Text Classification using BERT Model
In this NLP Project, you will learn how to build a multi-class text classification model using using the pre-trained BERT model.

PyCaret Project to Build and Deploy an ML App using Streamlit
In this PyCaret Project, you will build a customer segmentation model with PyCaret and deploy the machine learning application using Streamlit.

Multilabel Classification Project for Predicting Shipment Modes
Multilabel Classification Project to build a machine learning model that predicts the appropriate mode of transport for each shipment, using a transport dataset with 2000 unique products. The project explores and compares four different approaches to multilabel classification, including naive independent models, classifier chains, natively multilabel models, and multilabel to multiclass approaches.

Build an End-to-End AWS SageMaker Classification Model
MLOps on AWS SageMaker -Learn to Build an End-to-End Classification Model on SageMaker to predict a patient’s cause of death.

Census Income Data Set Project-Predict Adult Census Income
Use the Adult Income dataset to predict whether income exceeds 50K yr based oncensus data.

Build a Multi ClassText Classification Model using Naive Bayes
Implement the Naive Bayes Algorithm to build a multi class text classification model in Python.