How to convert Categorical variables into Numerical Variables using ColumnTransformer in python

This recipe is a short example on how to convert categorical variables into numerical variables using ColumnTransformer in python

Recipe Objective

When we talk about categorical variables, we are talking about non numerical values like Strings or Text. For example, a city name, a store name - for that matter anything that is not a number is a categorical variable. For us humans, it is very easy to interpret and understand text - but for machines it is not the case.

In Machine Learning, all the algorithms in the sklearn library cannot handle categorical variables so before we give the data to the algorithm for training and predicting - we have to convert it to numbers.

So this recipe is a short example on how to convert categorical variables into numerical variables. Let's get started.

Step 1 - Import the library - ColumnTransformer

import numpy as np import pandas as pd from sklearn.preprocessing import OneHotEncoder from sklearn.compose import ColumnTransformer

Let's pause and look at these imports. Numpy and Pandas are the usual ones. If you've looked at the older recipes , LabelEncoder and OneHotEncoder was typically used to convert categorical variables. But here we are using ColumnTransformer - why?

In the newer versions of sklearn starting from 0.20 - they have introduced ColumnTransformer. Prior to this, doing this conversion was a 2 step process. First we have to label encode the data and then one hot encode it. But now with ColumnTransformer, they have combined this - so you can do the entire conversion with just one line of code as shown in Step 3 below.

Step 2 - Setup the Data

dataset = {'research': [15000, 90000, 250000, 175000, 88000, 210000], 'marketing': [5000, 25000, 31000, 44000, 19700, 21111], 'city': ['Texas', 'Delaware', 'Florida', 'Texas', 'Delaware','Florida'], 'profit': [9999, 5555, 3333, 4444, 1111, 2222] df = pd.DataFrame(dataset)

Let us create a simple dataset and convert it to a dataframe. This sample data shows how much a company spends on research and marketing. We will add the state where the company is present. The last variable is the profit that company makes. So in a real example, we will be predicting the profit. But in this recipe, we will not be going into that.

Now our dataset is ready

Step 3 - Create an object of ColumnTransformer class

Before we do that, let's look at the important parameters that we need to pass.


1) transformer

  • name: this is just a name that we pass to the transformer
  • transformer: here we need to pass an estimator that supports fit and transform. Since we want to encode the data, we will be passing OneHotEncoder here.
  • columns: the index of columns that contain the categorical values that we want to convert

2) remainder
This parameter basically tells the transformer what to do with the remaining columns other than the ones we have mentioned above for converting. The values can be "drop" or "passthrough". The default value is drop which means only the transformed columns will be returned by the transformer and the remaining columns will be dropped. But if we want the transformer to pass them - we have to use the value "passthrough"

Now that we understand, let's create the object

columnTransformer = ColumnTransformer([('encoder', OneHotEncoder(), [2])], remainder='passthrough')

  • As you can see, we have named the transformer as "encoder"
  • For the estimator, we are passing an object of OneHotEncoder
  • And for columns, we can see from our dataset that the 3rd column is the city column that we want to transform. Since the index start at 0, we are passing value 2 here
  • Lastly for remainder we are passing "passthrough" as we want the transformer to return the remaining columns

Step 4 - Convert the categorical data with one line of code

Now that we have got the ColumnTransformer constructor ready, we just have to call the fit_transform method and pass the dataset to it to do the conversion

df = np.array(columnTransformer.fit_transform(df), dtype = np.str)

Step 5 - Lets look at our dataset now

Once we run the above code snippet, we will see that all States have been converted to numbers and added to the beginning.

For example :
Texas has been represented by 0,0,1
Delaware by 1,0, 0
Florida by 0,1,0.

print(df)

[['0.0' '0.0' '1.0' '15000.0' '5000.0' '9999.0']
 ['1.0' '0.0' '0.0' '90000.0' '25000.0' '5555.0']
 ['0.0' '1.0' '0.0' '250000.0' '31000.0' '3333.0']
 ['0.0' '0.0' '1.0' '175000.0' '44000.0' '4444.0']
 ['1.0' '0.0' '0.0' '88000.0' '19700.0' '1111.0']
 ['0.0' '1.0' '0.0' '210000.0' '21111.0' '2222.0']]

What Users are saying..

profile image

Anand Kumpatla

Sr Data Scientist @ Doubleslash Software Solutions Pvt Ltd
linkedin profile url

ProjectPro is a unique platform and helps many people in the industry to solve real-life problems with a step-by-step walkthrough of projects. A platform with some fantastic resources to gain... Read More

Relevant Projects

Census Income Data Set Project-Predict Adult Census Income
Use the Adult Income dataset to predict whether income exceeds 50K yr based oncensus data.

Multi-Class Text Classification with Deep Learning using BERT
In this deep learning project, you will implement one of the most popular state of the art Transformer models, BERT for Multi-Class Text Classification

Build a Text Generator Model using Amazon SageMaker
In this Deep Learning Project, you will train a Text Generator Model on Amazon Reviews Dataset using LSTM Algorithm in PyTorch and deploy it on Amazon SageMaker.

Recommender System Machine Learning Project for Beginners-3
Content Based Recommender System Project - Building a Content-Based Product Recommender App with Streamlit

Build Regression (Linear,Ridge,Lasso) Models in NumPy Python
In this machine learning regression project, you will learn to build NumPy Regression Models (Linear Regression, Ridge Regression, Lasso Regression) from Scratch.

House Price Prediction Project using Machine Learning in Python
Use the Zillow Zestimate Dataset to build a machine learning model for house price prediction.

Build a Multi Class Image Classification Model Python using CNN
This project explains How to build a Sequential Model that can perform Multi Class Image Classification in Python using CNN

AWS MLOps Project to Deploy a Classification Model [Banking]
In this AWS MLOps project, you will learn how to deploy a classification model using Flask on AWS.

AWS MLOps Project for ARCH and GARCH Time Series Models
Build and deploy ARCH and GARCH time series forecasting models in Python on AWS .

CycleGAN Implementation for Image-To-Image Translation
In this GAN Deep Learning Project, you will learn how to build an image to image translation model in PyTorch with Cycle GAN.