How to convert categorical variables into numerical variables in Python?

How to convert categorical variables into numerical variables in Python?

How to convert categorical variables into numerical variables in Python?

This recipe helps you convert categorical variables into numerical variables in Python


Recipe Objective

Machine Learning Models can not work on categorical variables in the form of strings, so we need to change it into numerical form. We can assign numbers for each categories but it may not be that effective when difference between the categories can not be measured. This can be done by making new features according to the categories with bool values. For this we will be using dummy variables to do so.

This python source code does the following:
1. Creates dictionary and converts it into dataframe
2. Uses "get_dummies" function for the encoding
3. Concats the final encoded dataset into the final dataframe
4. Drops categorical variable column

So this is the recipe on how we can convert categorical variables into numerical variables in Python.

Step 1 - Import the library

import pandas as pd

We have only imported pandas this is reqired for dataset.

Step 2 - Setting up the Data

We have created a dictionary and passed it through the pd.DataFrame to create a dataframe with columns 'name', 'episodes', 'gender'. data = {'name': ['Sheldon', 'Penny', 'Amy', 'Penny', 'Raj', 'Sheldon'], 'episodes': [42, 24, 31, 29, 37, 40], 'gender': ['male', 'female', 'female', 'female', 'male', 'male']} df = pd.DataFrame(data, columns = ['name','episodes', 'gender']) print(df)

Step 3 - Making Dummy Variables and Printing the final Dataset

We can clearly observe that in the column 'gender' there are two categories male and female, so for that column we have to make dummies according to the categories. So we have passed that column in the function and stored it in df_gender. Finally we have added that columns in out original dataset. df_gender = pd.get_dummies(df['gender']) df_new = pd.concat([df, df_gender], axis=1) print(df_new) So the output comes as:

      name  episodes  gender
0  Sheldon        42    male
1    Penny        24  female
2      Amy        31  female
3    Penny        29  female
4      Raj        37    male
5  Sheldon        40    male

      name  episodes  gender  female  male
0  Sheldon        42    male       0     1
1    Penny        24  female       1     0
2      Amy        31  female       1     0
3    Penny        29  female       1     0
4      Raj        37    male       0     1
5  Sheldon        40    male       0     1

Relevant Projects

Data Science Project - Instacart Market Basket Analysis
Data Science Project - Build a recommendation engine which will predict the products to be purchased by an Instacart consumer again.

Zillow’s Home Value Prediction (Zestimate)
Data Science Project in R -Build a machine learning algorithm to predict the future sale prices of homes.

Learn to prepare data for your next machine learning project
Text data requires special preparation before you can start using it for any machine learning project.In this ML project, you will learn about applying Machine Learning models to create classifiers and learn how to make sense of textual data.

Build a Similar Images Finder with Python, Keras, and Tensorflow
Build your own image similarity application using Python to search and find images of products that are similar to any given product. You will implement the K-Nearest Neighbor algorithm to find products with maximum similarity.

Choosing the right Time Series Forecasting Methods
There are different time series forecasting methods to forecast stock price, demand etc. In this machine learning project, you will learn to determine which forecasting method to be used when and how to apply with time series forecasting example.

Data Science Project in Python on BigMart Sales Prediction
The goal of this data science project is to build a predictive model and find out the sales of each product at a given Big Mart store.

Expedia Hotel Recommendations Data Science Project
In this data science project, you will contextualize customer data and predict the likelihood a customer will stay at 100 different hotel groups.

Identifying Product Bundles from Sales Data Using R Language
In this data science project in R, we are going to talk about subjective segmentation which is a clustering technique to find out product bundles in sales data.

Perform Time series modelling using Facebook Prophet
In this project, we are going to talk about Time Series Forecasting to predict the electricity requirement for a particular house using Prophet.

Machine Learning Project to Forecast Rossmann Store Sales
In this machine learning project you will work on creating a robust prediction model of Rossmann's daily sales using store, promotion, and competitor data.