How does Linear Discriminant Analysis work in R?

How does Linear Discriminant Analysis work in R
Last Updated: 16 Jun 2022

Get access to Data Science projects View all Data Science projects

MACHINE LEARNING RECIPES DATA CLEANING PYTHON DATA MUNGING PANDAS CHEATSHEET ALL TAGS

Recipe Objective

Linear Discriminant Analysis is a classifier with a linear decision boundary, generated by fitting class conditional densities to the data and using Bayes' rule. It fits a Gaussian density to each class, assuming that all classes share the same covariance matrix (i.e. for multivariate analysis the value of p is greater than 1). It is used to solve a classification problem and is a dimensionality reduction technique.

In this recipe, we will go through how to carry out LDA in R for Multi-class classification, we use the iris dataset

Recipe Objective

STEP 1: Importing Necessary Libraries

library(caret) library(tidyverse) # for data manipulation

STEP 2: Read a csv file and explore the data

Data Description: The dataset consists of 50 samples from each of the three species of flower (setosa, virginica, versicolor)

Independent Variables:

petal_length
petal_width
sepal_length
sepal_width

Dependent Variable:

label (iris-setosa, iris - versicular, iris - virginica)

data <- read.csv("R_298_Iris.csv") glimpse(data)

Rows: 150
Columns: 5
$ petal_length  5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4,...
$ petal_width   3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7,...
$ sepal_length  1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5,...
$ sepal_width   0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2,...
$ label         Iris-setosa, Iris-setosa, Iris-setosa, Iris-setosa, Ir...

summary(data) # returns the statistical summary of the data columns

petal_length    petal_width     sepal_length    sepal_width   
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
 Median :5.800   Median :3.000   Median :4.350   Median :1.300  
 Mean   :5.843   Mean   :3.054   Mean   :3.759   Mean   :1.199  
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
             label   
 Iris-setosa    :50  
 Iris-versicolor:50  
 Iris-virginica :50

dim(data)

150 5

# Converting the dependent variable into factor levels data$label = as.factor(data$label)

STEP 3: Train Test Split

# createDataPartition() function from the caret package to split the original dataset into a training and testing set and split data into training (80%) and testing set (20%) parts = createDataPartition(data$Cost, p = .8, list = F) train = data[parts, ] test = data[-parts, ] X_train = train[,-5] y_train = train[,5]

STEP 4: Building lda model

We will use caret package to perform this task. First, we will use the trainControl() function to define the method of cross validation to be carried out. Then train the model using train() function.

Syntax: train(formula, data = , method = , trControl = , tuneGrid = )

where:

formula = y~x1+x2+x3+..., where y is the independent variable and x1,x2,x3 are the dependent variables
data = dataframe
method = Type of the model to be built ("lda2" for Linear Discriminant analysis)
trControl = Takes the control parameters. We will use trainControl function out here where we will specify the Cross validation technique.
tuneGrid = takes the tuning parameters and applies grid search CV on them

# specifying the CV technique which will be passed into the train() function later and number parameter is the "k" in K-fold cross validation train_control = trainControl(method = "cv", number = 5) set.seed(50) # training a Linear discriminant analysis model (Method = "lda2") model = train(x = X_train, y = y_train, method = "lda2", trControl = train_control) # summarising the results print(model)

Linear Discriminant Analysis 

120 samples
  4 predictor
  3 classes: 'Iris-setosa', 'Iris-versicolor', 'Iris-virginica' 

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 96, 96, 96, 96, 96 
Resampling results across tuning parameters:

  dimen  Accuracy   Kappa 
  1      0.9750000  0.9625
  2      0.9666667  0.9500

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was dimen = 1.

STEP 5: Make predictions

We use our final lda model to make predictions on the testing data (unseen data) and predict the 'label' value and generate Confusion matrix.

#use model to make predictions on test data pred_y = predict(model, test) # confusion Matrix confusionMatrix(data = pred_y, test$label)

Confusion Matrix and Statistics

                 Reference
Prediction        Iris-setosa Iris-versicolor Iris-virginica
  Iris-setosa              10               0              0
  Iris-versicolor           0              10              0
  Iris-virginica            0               0             10

Overall Statistics
                                     
               Accuracy : 1          
                 95% CI : (0.8843, 1)
    No Information Rate : 0.3333     
    P-Value [Acc > NIR] : 4.857e-15  
                                     
                  Kappa : 1          
                                     
 Mcnemar's Test P-Value : NA         

Statistics by Class:

                     Class: Iris-setosa Class: Iris-versicolor
Sensitivity                      1.0000                 1.0000
Specificity                      1.0000                 1.0000
Pos Pred Value                   1.0000                 1.0000
Neg Pred Value                   1.0000                 1.0000
Prevalence                       0.3333                 0.3333
Detection Rate                   0.3333                 0.3333
Detection Prevalence             0.3333                 0.3333
Balanced Accuracy                1.0000                 1.0000
                     Class: Iris-virginica
Sensitivity                         1.0000
Specificity                         1.0000
Pos Pred Value                      1.0000
Neg Pred Value                      1.0000
Prevalence                          0.3333
Detection Rate                      0.3333
Detection Prevalence                0.3333
Balanced Accuracy                   1.0000

What Users are saying..

Ray han

Tech Leader | Stanford / Yale University

I think that they are fantastic. I attended Yale and Stanford and have worked at Honeywell,Oracle, and Arthur Andersen(Accenture) in the US. I have taken Big Data and Hadoop,NoSQL, Spark, Hadoop... Read More