How to perform Chi squared test in R?

This recipe helps you perform Chi squared test in R

Recipe Objective

The chi-square test for independence, can be used and interpreted in two different ways: ​

  1. Testing hypotheses about the relationship between two variables in a population, or
  2. Testing hypotheses about differences between proportions for two or more populations.

The data used for carrying out this type of test is called observed frequencies, simply show how many individuals from the sample are in each cell of the matrix. ​

Sentiment Analysis Project on eCommerce Product Reviews with Source Code

Hypothesis testing with Chi-square includes the following ​

  1. Null Hypothesis: The two variables are not associated
  2. Alternate Hypothesis: The two variables are associated or related

The calculation of the chi-square statistic requires two steps: ​

  1. The null hypothesis is used to construct an idealized sample distribution of expected frequencies that describes how the sample would look if the data were in perfect agreement with the null hypothesis.
  2. A chi-square statistic is computed to measure the amount of discrepancy between the ideal sample (expected frequencies from H0) and the actual sample data (the observed frequencies = fo).

A large discrepancy results in a large value for chi-square and indicates that the data do not fit the null hypothesis and the hypothesis should be rejected. ​

In this recipe, we learn how to perform Chi-square test in R. ​

STEP 1: Loading the required dataset and creating a data table

We will be using Cars93 data in the "MASS" library which contains all the information regarding the sales of different models of car in the year 1993. ​

We will carry out the chi-square test too find a relationship between "AirBags" and "Type" Variable. For that we will first create table of the same. ​

# loading the MASS library library("MASS") # printing a glimpse of cars93 data str(Cars93) # creating a table for features car_data_ = table(Cars93$AirBags, Cars93$Type) car_data_

'data.frame':	93 obs. of  27 variables:
 $ Manufacturer      : Factor w/ 32 levels "Acura","Audi",..: 1 1 2 2 3 4 4 4 4 5 ...
 $ Model             : Factor w/ 93 levels "100","190E","240",..: 49 56 9 1 6 24 54 74 73 35 ...
 $ Type              : Factor w/ 6 levels "Compact","Large",..: 4 3 1 3 3 3 2 2 3 2 ...
 $ Min.Price         : num  12.9 29.2 25.9 30.8 23.7 14.2 19.9 22.6 26.3 33 ...
 $ Price             : num  15.9 33.9 29.1 37.7 30 15.7 20.8 23.7 26.3 34.7 ...
 $ Max.Price         : num  18.8 38.7 32.3 44.6 36.2 17.3 21.7 24.9 26.3 36.3 ...
 $ MPG.city          : int  25 18 20 19 22 22 19 16 19 16 ...
 $ MPG.highway       : int  31 25 26 26 30 31 28 25 27 25 ...
 $ AirBags           : Factor w/ 3 levels "Driver & Passenger",..: 3 1 2 1 2 2 2 2 2 2 ...
 $ DriveTrain        : Factor w/ 3 levels "4WD","Front",..: 2 2 2 2 3 2 2 3 2 2 ...
 $ Cylinders         : Factor w/ 6 levels "3","4","5","6",..: 2 4 4 4 2 2 4 4 4 5 ...
 $ EngineSize        : num  1.8 3.2 2.8 2.8 3.5 2.2 3.8 5.7 3.8 4.9 ...
 $ Horsepower        : int  140 200 172 172 208 110 170 180 170 200 ...
 $ RPM               : int  6300 5500 5500 5500 5700 5200 4800 4000 4800 4100 ...
 $ Rev.per.mile      : int  2890 2335 2280 2535 2545 2565 1570 1320 1690 1510 ...
 $ Man.trans.avail   : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 1 1 1 1 ...
 $ Fuel.tank.capacity: num  13.2 18 16.9 21.1 21.1 16.4 18 23 18.8 18 ...
 $ Passengers        : int  5 5 5 6 4 6 6 6 5 6 ...
 $ Length            : int  177 195 180 193 186 189 200 216 198 206 ...
 $ Wheelbase         : int  102 115 102 106 109 105 111 116 108 114 ...
 $ Width             : int  68 71 67 70 69 69 74 78 73 73 ...
 $ Turn.circle       : int  37 38 37 37 39 41 42 45 41 43 ...
 $ Rear.seat.room    : num  26.5 30 28 31 27 28 30.5 30.5 26.5 35 ...
 $ Luggage.room      : int  11 15 14 17 13 16 17 21 14 18 ...
 $ Weight            : int  2705 3560 3375 3405 3640 2880 3470 4105 3495 3620 ...
 $ Origin            : Factor w/ 2 levels "USA","non-USA": 2 2 2 2 2 1 1 1 1 1 ...
 $ Make              : Factor w/ 93 levels "Acura Integra",..: 1 2 4 3 5 6 7 9 8 10 ...
                    
                     Compact Large Midsize Small Sporty Van
  Driver & Passenger       2     4       7     0      3   0
  Driver only              9     7      11     5      8   3
  None                     5     0       4    16      3   6

STEP 2: Carrying out Chi-square test

We use chisq.test(x) function to run the chi-square test between two variables (AirBags and Type) where x is a data table containing frequencies of the same

chisq.test(car_data_)

Pearson's Chi-squared test

data:  car_data_
X-squared = 33.001, df = 10, p-value = 0.0002723

Result: After checking the p-value of the chi-square statistic, we see that it's lower than 0.05. This means that we reject the null hypothesis i.e. The two group variables are associated with each other.

What Users are saying..

profile image

Ameeruddin Mohammed

ETL (Abintio) developer at IBM
linkedin profile url

I come from a background in Marketing and Analytics and when I developed an interest in Machine Learning algorithms, I did multiple in-class courses from reputed institutions though I got good... Read More

Relevant Projects

PyTorch Project to Build a LSTM Text Classification Model
In this PyTorch Project you will learn how to build an LSTM Text Classification model for Classifying the Reviews of an App .

MLOps Project to Build Search Relevancy Algorithm with SBERT
In this MLOps SBERT project you will learn to build and deploy an accurate and scalable search algorithm on AWS using SBERT and ANNOY to enhance search relevancy in news articles.

Learn to Build a Siamese Neural Network for Image Similarity
In this Deep Learning Project, you will learn how to build a siamese neural network with Keras and Tensorflow for Image Similarity.

Credit Card Default Prediction using Machine learning techniques
In this data science project, you will predict borrowers chance of defaulting on credit loans by building a credit score prediction model.

AWS MLOps Project for ARCH and GARCH Time Series Models
Build and deploy ARCH and GARCH time series forecasting models in Python on AWS .

Learn to Build an End-to-End Machine Learning Pipeline - Part 1
In this Machine Learning Project, you will learn how to build an end-to-end machine learning pipeline for predicting truck delays, addressing a major challenge in the logistics industry.

Build Portfolio Optimization Machine Learning Models in R
Machine Learning Project for Financial Risk Modelling and Portfolio Optimization with R- Build a machine learning model in R to develop a strategy for building a portfolio for maximized returns.

Predict Churn for a Telecom company using Logistic Regression
Machine Learning Project in R- Predict the customer churn of telecom sector and find out the key drivers that lead to churn. Learn how the logistic regression model using R can be used to identify the customer churn in telecom dataset.

Create Your First Chatbot with RASA NLU Model and Python
Learn the basic aspects of chatbot development and open source conversational AI RASA to create a simple AI powered chatbot on your own.

Mastering A/B Testing: A Practical Guide for Production
In this A/B Testing for Machine Learning Project, you will gain hands-on experience in conducting A/B tests, analyzing statistical significance, and understanding the challenges of building a solution for A/B testing in a production environment.