How to perform ANOVA in R?

This recipe helps you perform ANOVA in R
Last Updated: 06 May 2021

Get access to Data Science projects View all Data Science projects

MACHINE LEARNING RECIPES DATA CLEANING PYTHON DATA MUNGING PANDAS CHEATSHEET ALL TAGS

Recipe Objective

ANOVA which is short for ANalysis Of VAriance can determine whether the means of two or more sample groups are different from each other or not. It uses F-test to statistically test equality of means.

ANOVA uses both between group variability and within group variability to test whether the population mens are significantly different from each other or not.

F-statistic is the ratio of between group variability to within group variability. Large F signifies greater dispersion.

Hypothesis testing with ANOVA includes the following:

Null Hypothesis: There is no difference in the means
Alternate Hypothesis: At least one pair of samples is significantly different

In this recipe, we learn how to perform one-way ANOVA test in R.

STEP 1: Reading the sample and hypothesis testing

Example: A study to test the effects of 3 types of fertilizer on crop yield.

Null Hypothesis: No significantly effect on the crop yield
Alternate Hypothesis: At least one pair fertilizers has a significant effect on crop yield


# data manipulation
library(tidyverse)

sample = read.csv("R_205_crop_sample.csv", colClasses = c("factor", "factor", "factor", "numeric"), header = TRUE)

glimpse(sample)

Observations: 96
Variables: 4
$ density     1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1,...
$ block       1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3,...
$ fertilizer  1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
$ yield       177.2287, 177.5500, 176.4085, 177.7036, 177.1255, 176.77...

STEP 2: Carrying out ANOVA test

We use aov() function to run the test and summary() to print the results of the model.

Syntax: aov(y ~ X1+X2+X3+..., data = )

where:

y = dependent variable
X1,X2,X3 = independent variables


anova_one_way = aov(yield ~ fertilizer, data = sample)

summary(anova_one_way)

Df Sum Sq Mean Sq F value Pr(>F)    
fertilizer   2   6.07  3.0340   7.863  7e-04 ***
Residuals   93  35.89  0.3859                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Result: After checking the Pr(>F) which is the p-value of the F-statistic, we see that it's lower than 0.05. This means that atleast one pair of fertilizers used has a real impact on the final crop yield.

What Users are saying..

Ed Godalle

Director Data Analytics at EY / EY Tech

I am the Director of Data Analytics with over 10+ years of IT experience. I have a background in SQL, Python, and Big Data working with Accenture, IBM, and Infosys. I am looking to enhance my skills... Read More