How to perform Chi squared test in R?

This recipe helps you perform Chi squared test in R
Last Updated: 19 Aug 2022

Get access to Data Science projects View all Data Science projects

MACHINE LEARNING RECIPES DATA CLEANING PYTHON DATA MUNGING PANDAS CHEATSHEET ALL TAGS

Recipe Objective

The chi-square test for independence, can be used and interpreted in two different ways:

Testing hypotheses about the relationship between two variables in a population, or
Testing hypotheses about differences between proportions for two or more populations.

The data used for carrying out this type of test is called observed frequencies, simply show how many individuals from the sample are in each cell of the matrix.

Sentiment Analysis Project on eCommerce Product Reviews with Source Code

Hypothesis testing with Chi-square includes the following

Null Hypothesis: The two variables are not associated
Alternate Hypothesis: The two variables are associated or related

The calculation of the chi-square statistic requires two steps:

The null hypothesis is used to construct an idealized sample distribution of expected frequencies that describes how the sample would look if the data were in perfect agreement with the null hypothesis.
A chi-square statistic is computed to measure the amount of discrepancy between the ideal sample (expected frequencies from H0) and the actual sample data (the observed frequencies = fo).

A large discrepancy results in a large value for chi-square and indicates that the data do not fit the null hypothesis and the hypothesis should be rejected.

In this recipe, we learn how to perform Chi-square test in R.

STEP 1: Loading the required dataset and creating a data table

We will be using Cars93 data in the "MASS" library which contains all the information regarding the sales of different models of car in the year 1993.

We will carry out the chi-square test too find a relationship between "AirBags" and "Type" Variable. For that we will first create table of the same.

# loading the MASS library library("MASS") # printing a glimpse of cars93 data str(Cars93) # creating a table for features car_data_ = table(Cars93$AirBags, Cars93$Type) car_data_

'data.frame':	93 obs. of  27 variables:
 $ Manufacturer      : Factor w/ 32 levels "Acura","Audi",..: 1 1 2 2 3 4 4 4 4 5 ...
 $ Model             : Factor w/ 93 levels "100","190E","240",..: 49 56 9 1 6 24 54 74 73 35 ...
 $ Type              : Factor w/ 6 levels "Compact","Large",..: 4 3 1 3 3 3 2 2 3 2 ...
 $ Min.Price         : num  12.9 29.2 25.9 30.8 23.7 14.2 19.9 22.6 26.3 33 ...
 $ Price             : num  15.9 33.9 29.1 37.7 30 15.7 20.8 23.7 26.3 34.7 ...
 $ Max.Price         : num  18.8 38.7 32.3 44.6 36.2 17.3 21.7 24.9 26.3 36.3 ...
 $ MPG.city          : int  25 18 20 19 22 22 19 16 19 16 ...
 $ MPG.highway       : int  31 25 26 26 30 31 28 25 27 25 ...
 $ AirBags           : Factor w/ 3 levels "Driver & Passenger",..: 3 1 2 1 2 2 2 2 2 2 ...
 $ DriveTrain        : Factor w/ 3 levels "4WD","Front",..: 2 2 2 2 3 2 2 3 2 2 ...
 $ Cylinders         : Factor w/ 6 levels "3","4","5","6",..: 2 4 4 4 2 2 4 4 4 5 ...
 $ EngineSize        : num  1.8 3.2 2.8 2.8 3.5 2.2 3.8 5.7 3.8 4.9 ...
 $ Horsepower        : int  140 200 172 172 208 110 170 180 170 200 ...
 $ RPM               : int  6300 5500 5500 5500 5700 5200 4800 4000 4800 4100 ...
 $ Rev.per.mile      : int  2890 2335 2280 2535 2545 2565 1570 1320 1690 1510 ...
 $ Man.trans.avail   : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 1 1 1 1 ...
 $ Fuel.tank.capacity: num  13.2 18 16.9 21.1 21.1 16.4 18 23 18.8 18 ...
 $ Passengers        : int  5 5 5 6 4 6 6 6 5 6 ...
 $ Length            : int  177 195 180 193 186 189 200 216 198 206 ...
 $ Wheelbase         : int  102 115 102 106 109 105 111 116 108 114 ...
 $ Width             : int  68 71 67 70 69 69 74 78 73 73 ...
 $ Turn.circle       : int  37 38 37 37 39 41 42 45 41 43 ...
 $ Rear.seat.room    : num  26.5 30 28 31 27 28 30.5 30.5 26.5 35 ...
 $ Luggage.room      : int  11 15 14 17 13 16 17 21 14 18 ...
 $ Weight            : int  2705 3560 3375 3405 3640 2880 3470 4105 3495 3620 ...
 $ Origin            : Factor w/ 2 levels "USA","non-USA": 2 2 2 2 2 1 1 1 1 1 ...
 $ Make              : Factor w/ 93 levels "Acura Integra",..: 1 2 4 3 5 6 7 9 8 10 ...
                    
                     Compact Large Midsize Small Sporty Van
  Driver & Passenger       2     4       7     0      3   0
  Driver only              9     7      11     5      8   3
  None                     5     0       4    16      3   6

STEP 2: Carrying out Chi-square test

We use chisq.test(x) function to run the chi-square test between two variables (AirBags and Type) where x is a data table containing frequencies of the same

chisq.test(car_data_)

Pearson's Chi-squared test

data:  car_data_
X-squared = 33.001, df = 10, p-value = 0.0002723

Result: After checking the p-value of the chi-square statistic, we see that it's lower than 0.05. This means that we reject the null hypothesis i.e. The two group variables are associated with each other.

What Users are saying..

Ameeruddin Mohammed

ETL (Abintio) developer at IBM

I come from a background in Marketing and Analytics and when I developed an interest in Machine Learning algorithms, I did multiple in-class courses from reputed institutions though I got good... Read More