How to split data into train set and test set in R?

This recipe helps you split data into train set and test set in R
Last Updated: 22 Dec 2022

Get access to Data Science projects View all Data Science projects

MACHINE LEARNING RECIPES DATA CLEANING PYTHON DATA MUNGING PANDAS CHEATSHEET ALL TAGS

Recipe Objective

In supervides learning alogorithms such as Linear Regression, Logistic Regression and Decision Trees, tt is very crucial to split the data into training and testing sets. We first train the model using the observations in the traning dataset and then use this model to predict from testing dataset.

The purpose of splitting is to avoid overfitting i.e. paying attention to minor details/noise which are not necessary and only optimizes the training dataset accuracy. In the end, we need a model which can perform well on unseen data so we keep use the test data in the very end to test the trained model performance.

In this recipe, you will learn how to split the data into training and testing dataset.

STEP 1: Reading the data and importing required packages

We use "caTools" package to get the required function "sample.split()" to split the dataset.

# installing and loading catools package install.packages("caTools") library(caTools) # creating a dataframe after reading the data from a csv file data_1 = read.csv("R_232_Data_1.csv") head(data_1)

Cost	Weight	Weight1	Length	Height	Width
242	23.2	25.4	30.0	11.5200	4.0200
290	24.0	26.3	31.2	12.4800	4.3056
340	23.9	26.5	31.1	12.3778	4.6961
363	26.3	29.0	33.5	12.7300	4.4555
430	26.5	29.0	34.0	12.4440	5.1340
450	26.8	29.7	34.7	13.6024	4.9274

STEP 2: Splitting the dataset into Train and test data

We use sample.split() and subset() function to do so.

Syntax: sample.split(Y = , SplitRatio = )

Where:

Y = target variable
SplitRatio = no of train observation divided by the total number of test observation. for eg. SplitRatio for 70%:30% (Train:Test) is 0.7. The observations are chosen randomly.

ind = sample.split(Y = data_1$Cost, SplitRatio = 0.7) #subsetting into Train data train = data_1[ind,] #subsetting into Test data test = data_1[!ind,]

Now, checking the dimensions of the train and test data created so check whether this worked or not

dim(train) dim(test)

111	6
48	6

What Users are saying..

Anand Kumpatla

Sr Data Scientist @ Doubleslash Software Solutions Pvt Ltd

ProjectPro is a unique platform and helps many people in the industry to solve real-life problems with a step-by-step walkthrough of projects. A platform with some fantastic resources to gain... Read More

Relevant Projects

Machine Learning Projects

Data Science Projects

Python Projects for Data Science

Data Science Projects in R

Machine Learning Projects for Beginners

Deep Learning Projects

Neural Network Projects

Tensorflow Projects

NLP Projects

Kaggle Projects

IoT Projects

Big Data Projects

Hadoop Real-Time Projects Examples

Spark Projects

Data Analytics Projects for Students

Relevant Projects

BigMart Sales Prediction ML Project in Python

The goal of the BigMart Sales Prediction ML project is to build and evaluate different predictive models and determine the sales of each product at a store.

View Project Details

Walmart Sales Forecasting Data Science Project

Data Science Project in R-Predict the sales for each department using historical markdown data from the Walmart dataset containing data of 45 Walmart stores.

View Project Details

Learn to Build an End-to-End Machine Learning Pipeline - Part 1

In this Machine Learning Project, you will learn how to build an end-to-end machine learning pipeline for predicting truck delays, addressing a major challenge in the logistics industry.

View Project Details

Digit Recognition using CNN for MNIST Dataset in Python

In this deep learning project, you will build a convolutional neural network using MNIST dataset for handwritten digit recognition.

View Project Details

Ola Bike Rides Request Demand Forecast

Given big data at taxi service (ride-hailing) i.e. OLA, you will learn multi-step time series forecasting and clustering with Mini-Batch K-means Algorithm on geospatial data to predict future ride requests for a particular region at a given time.

View Project Details

Build a Multi ClassText Classification Model using Naive Bayes

Implement the Naive Bayes Algorithm to build a multi class text classification model in Python.

View Project Details

Build a Graph Based Recommendation System in Python -Part 1

Python Recommender Systems Project - Learn to build a graph based recommendation system in eCommerce to recommend products.

View Project Details

Recommender System Machine Learning Project for Beginners-4

Collaborative Filtering Recommender System Project - Comparison of different model based and memory based methods to build recommendation system using collaborative filtering.

View Project Details

MLOps Project to Build Search Relevancy Algorithm with SBERT

In this MLOps SBERT project you will learn to build and deploy an accurate and scalable search algorithm on AWS using SBERT and ANNOY to enhance search relevancy in news articles.

View Project Details

A/B Testing Approach for Comparing Performance of ML Models

The objective of this project is to compare the performance of BERT and DistilBERT models for building an efficient Question and Answering system. Using A/B testing approach, we explore the effectiveness and efficiency of both models and determine which one is better suited for Q&A tasks.

View Project Details

How to split data into train set and test set in R?

Recipe Objective

STEP 1: Reading the data and importing required packages

STEP 2: Splitting the dataset into Train and test data

Anand Kumpatla

Relevant Projects

You might also like

Relevant Projects