How to convert categorical variables into numerical variables in Python?

How to convert categorical variables into numerical variables in Python?

How to convert categorical variables into numerical variables in Python?

This recipe helps you convert categorical variables into numerical variables in Python


Recipe Objective

Machine Learning Models can not work on categorical variables in the form of strings, so we need to change it into numerical form. We can assign numbers for each categories but it may not be that effective when difference between the categories can not be measured. This can be done by making new features according to the categories with bool values. For this we will be using dummy variables to do so.

This python source code does the following:
1. Creates dictionary and converts it into dataframe
2. Uses "get_dummies" function for the encoding
3. Concats the final encoded dataset into the final dataframe
4. Drops categorical variable column

So this is the recipe on how we can convert categorical variables into numerical variables in Python.

Step 1 - Import the library

import pandas as pd

We have only imported pandas this is reqired for dataset.

Step 2 - Setting up the Data

We have created a dictionary and passed it through the pd.DataFrame to create a dataframe with columns 'name', 'episodes', 'gender'. data = {'name': ['Sheldon', 'Penny', 'Amy', 'Penny', 'Raj', 'Sheldon'], 'episodes': [42, 24, 31, 29, 37, 40], 'gender': ['male', 'female', 'female', 'female', 'male', 'male']} df = pd.DataFrame(data, columns = ['name','episodes', 'gender']) print(df)

Step 3 - Making Dummy Variables and Printing the final Dataset

We can clearly observe that in the column 'gender' there are two categories male and female, so for that column we have to make dummies according to the categories. So we have passed that column in the function and stored it in df_gender. Finally we have added that columns in out original dataset. df_gender = pd.get_dummies(df['gender']) df_new = pd.concat([df, df_gender], axis=1) print(df_new) So the output comes as:

      name  episodes  gender
0  Sheldon        42    male
1    Penny        24  female
2      Amy        31  female
3    Penny        29  female
4      Raj        37    male
5  Sheldon        40    male

      name  episodes  gender  female  male
0  Sheldon        42    male       0     1
1    Penny        24  female       1     0
2      Amy        31  female       1     0
3    Penny        29  female       1     0
4      Raj        37    male       0     1
5  Sheldon        40    male       0     1

Relevant Projects

Machine Learning project for Retail Price Optimization
In this machine learning pricing project, we implement a retail price optimization algorithm using regression trees. This is one of the first steps to building a dynamic pricing model.

German Credit Dataset Analysis to Classify Loan Applications
In this data science project, you will work with German credit dataset using classification techniques like Decision Tree, Neural Networks etc to classify loan applications using R.

Data Science Project-TalkingData AdTracking Fraud Detection
Machine Learning Project in R-Detect fraudulent click traffic for mobile app ads using R data science programming language.

Identifying Product Bundles from Sales Data Using R Language
In this data science project in R, we are going to talk about subjective segmentation which is a clustering technique to find out product bundles in sales data.

Zillow’s Home Value Prediction (Zestimate)
Data Science Project in R -Build a machine learning algorithm to predict the future sale prices of homes.

Credit Card Fraud Detection as a Classification Problem
In this data science project, we will predict the credit card fraud in the transactional dataset using some of the predictive models.

Ecommerce product reviews - Pairwise ranking and sentiment analysis
This project analyzes a dataset containing ecommerce product reviews. The goal is to use machine learning models to perform sentiment analysis on product reviews and rank them based on relevance. Reviews play a key role in product recommendation systems.

Human Activity Recognition Using Smartphones Data Set
In this deep learning project, you will build a classification system where to precisely identify human fitness activities.

Mercari Price Suggestion Challenge Data Science Project
Data Science Project in Python- Build a machine learning algorithm that automatically suggests the right product prices.

Build an Image Classifier for Plant Species Identification
In this machine learning project, we will use binary leaf images and extracted features, including shape, margin, and texture to accurately identify plant species using different benchmark classification techniques.