MACHINE LEARNING RECIPES
DATA CLEANING PYTHON
DATA MUNGING
PANDAS CHEATSHEET
ALL TAGS
# How to perform Chi squared test in R?

# How to perform Chi squared test in R?

This recipe helps you perform Chi squared test in R

The chi-square test for independence, can be used and interpreted in two different ways:

- Testing hypotheses about the relationship between two variables in a population, or
- Testing hypotheses about differences between proportions for two or more populations.

The data used for carrying out this type of test is called observed frequencies, simply show how many individuals from the sample are in each cell of the matrix.

Hypothesis testing with Chi-square includes the following

- Null Hypothesis: The two variables are not associated
- Alternate Hypothesis: The two variables are associated or related

The calculation of the chi-square statistic requires two steps:

- The null hypothesis is used to construct an idealized sample distribution of expected frequencies that describes how the sample would look if the data were in perfect agreement with the null hypothesis.
- A chi-square statistic is computed to measure the amount of discrepancy between the ideal sample (expected frequencies from H0) and the actual sample data (the observed frequencies = fo).

A large discrepancy results in a large value for chi-square and indicates that the data do not fit the null hypothesis and the hypothesis should be rejected.

In this recipe, we learn how to perform Chi-square test in R.

We will be using Cars93 data in the "MASS" library which contains all the information regarding the sales of different models of car in the year 1993.

We will carry out the chi-square test too find a relationship between "AirBags" and "Type" Variable. For that we will first create table of the same.

```
# loading the MASS library
library("MASS")
# printing a glimpse of cars93 data
str(Cars93)
# creating a table for features
car_data_ = table(Cars93$AirBags, Cars93$Type)
car_data_
```

'data.frame': 93 obs. of 27 variables: $ Manufacturer : Factor w/ 32 levels "Acura","Audi",..: 1 1 2 2 3 4 4 4 4 5 ... $ Model : Factor w/ 93 levels "100","190E","240",..: 49 56 9 1 6 24 54 74 73 35 ... $ Type : Factor w/ 6 levels "Compact","Large",..: 4 3 1 3 3 3 2 2 3 2 ... $ Min.Price : num 12.9 29.2 25.9 30.8 23.7 14.2 19.9 22.6 26.3 33 ... $ Price : num 15.9 33.9 29.1 37.7 30 15.7 20.8 23.7 26.3 34.7 ... $ Max.Price : num 18.8 38.7 32.3 44.6 36.2 17.3 21.7 24.9 26.3 36.3 ... $ MPG.city : int 25 18 20 19 22 22 19 16 19 16 ... $ MPG.highway : int 31 25 26 26 30 31 28 25 27 25 ... $ AirBags : Factor w/ 3 levels "Driver & Passenger",..: 3 1 2 1 2 2 2 2 2 2 ... $ DriveTrain : Factor w/ 3 levels "4WD","Front",..: 2 2 2 2 3 2 2 3 2 2 ... $ Cylinders : Factor w/ 6 levels "3","4","5","6",..: 2 4 4 4 2 2 4 4 4 5 ... $ EngineSize : num 1.8 3.2 2.8 2.8 3.5 2.2 3.8 5.7 3.8 4.9 ... $ Horsepower : int 140 200 172 172 208 110 170 180 170 200 ... $ RPM : int 6300 5500 5500 5500 5700 5200 4800 4000 4800 4100 ... $ Rev.per.mile : int 2890 2335 2280 2535 2545 2565 1570 1320 1690 1510 ... $ Man.trans.avail : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 1 1 1 1 ... $ Fuel.tank.capacity: num 13.2 18 16.9 21.1 21.1 16.4 18 23 18.8 18 ... $ Passengers : int 5 5 5 6 4 6 6 6 5 6 ... $ Length : int 177 195 180 193 186 189 200 216 198 206 ... $ Wheelbase : int 102 115 102 106 109 105 111 116 108 114 ... $ Width : int 68 71 67 70 69 69 74 78 73 73 ... $ Turn.circle : int 37 38 37 37 39 41 42 45 41 43 ... $ Rear.seat.room : num 26.5 30 28 31 27 28 30.5 30.5 26.5 35 ... $ Luggage.room : int 11 15 14 17 13 16 17 21 14 18 ... $ Weight : int 2705 3560 3375 3405 3640 2880 3470 4105 3495 3620 ... $ Origin : Factor w/ 2 levels "USA","non-USA": 2 2 2 2 2 1 1 1 1 1 ... $ Make : Factor w/ 93 levels "Acura Integra",..: 1 2 4 3 5 6 7 9 8 10 ... Compact Large Midsize Small Sporty Van Driver & Passenger 2 4 7 0 3 0 Driver only 9 7 11 5 8 3 None 5 0 4 16 3 6

We use chisq.test(x) function to run the chi-square test between two variables (AirBags and Type) where x is a data table containing frequencies of the same

```
chisq.test(car_data_)
```

Pearson's Chi-squared test data: car_data_ X-squared = 33.001, df = 10, p-value = 0.0002723

Result: After checking the p-value of the chi-square statistic, we see that it's lower than 0.05. This means that we reject the null hypothesis i.e. The two group variables are associated with each other.

This project analyzes a dataset containing ecommerce product reviews. The goal is to use machine learning models to perform sentiment analysis on product reviews and rank them based on relevance. Reviews play a key role in product recommendation systems.

In this data science project, you will work with German credit dataset using classification techniques like Decision Tree, Neural Networks etc to classify loan applications using R.

In this data science project, we will predict the credit card fraud in the transactional dataset using some of the predictive models.

In this human activity recognition project, we use multiclass classification machine learning techniques to analyse fitness dataset from a smartphone tracker.

In this data science project, you will contextualize customer data and predict the likelihood a customer will stay at 100 different hotel groups.

In this deep learning project, we will predict customer churn using Artificial Neural Networks and learn how to model an ANN in R with the keras deep learning package.

In this machine learning and IoT project, we are going to test out the experimental data using various predictive models and train the models and break the energy usage.

There are different time series forecasting methods to forecast stock price, demand etc. In this machine learning project, you will learn to determine which forecasting method to be used when and how to apply with time series forecasting example.

In this machine learning project, you will uncover the predictive value in an uncertain world by using various artificial intelligence, machine learning, advanced regression and feature transformation techniques.

Music Recommendation Project using Machine Learning - Use the KKBox dataset to predict the chances of a user listening to a song again after their very first noticeable listening event.