How to handle dummy variables in R?
MACHINE LEARNING RECIPES DATA CLEANING PYTHON DATA MUNGING PANDAS CHEATSHEET     ALL TAGS

How to handle dummy variables in R?

How to handle dummy variables in R?

This recipe helps you handle dummy variables in R

0

Recipe Objective

In Data Science, whenever we create machine learning models using different algorithms, we want all our variables to be numeric for the algorithm to process it. If the data we have is non-numeric then we need to process or handle the data before creating any model. ​

In this recipe, we will learn how to handle string categorical variable by converting them into a dummmy variable.

Categorical variable is a type of variable which has distinct string values or categories to which different observations are assigned to. They don't hold any mathematical significance in creation of a model. Hence, we need to convert them into dummy variable which is similar to OneHotEncoding technique in Python. It creates (n-1) columns for n-unique categories/values in a categorical variable and assigns 0 and 1 to it. "1" indicating that the category is being considered.

Step 1: Loading the required library and dataset

We require fastDummies and knitr package to do so ​

# installing required package install.packages(c("fastDummies","knitr")) library(fastDummies) library(knitr) # Data manipulation package library(tidyverse) # reading a dataset customer_seg = read.csv('R_223_Mall_Customers.csv') glimpse(customer_seg)
Observations: 200
Variables: 5
$ CustomerID              1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1...
$ Gender                  Male, Male, Female, Female, Female, Female, ...
$ Age                     19, 21, 20, 23, 31, 22, 35, 23, 64, 30, 67, ...
$ Annual.Income..k..      15, 15, 16, 16, 17, 17, 18, 18, 19, 19, 19, ...
$ Spending.Score..1.100.  39, 81, 6, 77, 40, 76, 6, 94, 3, 72, 14, 99,...

Step 2: Creating dummy variable

We create dummy variables for "Gender" variable using dummy_cols() function of fastDummies package. ​

Syntax: fastDummies::dummy_cols(x, select_columns = ) ​

where: ​

  1. x = dataframe
  2. select_columns = Column (Categorical variable) that you wanna create dummy variables of.
# creating dummy variables df_dummies = fastDummies::dummy_cols(customer_seg, select_columns = "Gender") # dropping the original column along with Gender_female column to get (n-1) coluns similar to OneHotEncoding. new_customer_seg = df_dummies[c(-2,-6)] glimpse(new_customer_seg)
Rows: 200
Columns: 5
$ CustomerID              1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1...
$ Age                     19, 21, 20, 23, 31, 22, 35, 23, 64, 30, 67, ...
$ Annual.Income..k..      15, 15, 16, 16, 17, 17, 18, 18, 19, 19, 19, ...
$ Spending.Score..1.100.  39, 81, 6, 77, 40, 76, 6, 94, 3, 72, 14, 99,...
$ Gender_Male             1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1,...

Note: In the dummy variable (Gender_male) created: 1 = Male and 0 = Female ​

query_1 = mutate(STUDENT, Total_marks = Science_Marks+Math_Marks) glimpse(query_1)

Relevant Projects

NLP and Deep Learning For Fake News Classification in Python
In this project you will use Python to implement various machine learning methods( RNN, LSTM, GRU) for fake news classification.

PySpark Tutorial - Learn to use Apache Spark with Python
PySpark Project-Get a handle on using Python with Spark through this hands-on data processing spark python tutorial.

Demand prediction of driver availability using multistep time series analysis
In this supervised learning machine learning project, you will predict the availability of a driver in a specific area by using multi step time series analysis.

Build a Collaborative Filtering Recommender System in Python
Use the Amazon Reviews/Ratings dataset of 2 Million records to build a recommender system using memory-based collaborative filtering in Python.

Topic modelling using Kmeans clustering to group customer reviews
In this Kmeans clustering machine learning project, you will perform topic modelling in order to group customer reviews based on recurring patterns.

Customer Market Basket Analysis using Apriori and Fpgrowth algorithms
In this data science project, you will learn how to perform market basket analysis with the application of Apriori and FP growth algorithms based on the concept of association rule learning.

Data Science Project on Wine Quality Prediction in R
In this R data science project, we will explore wine dataset to assess red wine quality. The objective of this data science project is to explore which chemical properties will influence the quality of red wines.

Ecommerce product reviews - Pairwise ranking and sentiment analysis
This project analyzes a dataset containing ecommerce product reviews. The goal is to use machine learning models to perform sentiment analysis on product reviews and rank them based on relevance. Reviews play a key role in product recommendation systems.

Machine Learning or Predictive Models in IoT - Energy Prediction Use Case
In this machine learning and IoT project, we are going to test out the experimental data using various predictive models and train the models and break the energy usage.

Resume parsing with Machine learning - NLP with Python OCR and Spacy
In this machine learning resume parser example we use the popular Spacy NLP python library for OCR and text classification.