MACHINE LEARNING RECIPES
DATA CLEANING PYTHON
DATA MUNGING
PANDAS CHEATSHEET
ALL TAGS
# How to convert Categorical variables into Numerical Variables using ColumnTransformer?

# How to convert Categorical variables into Numerical Variables using ColumnTransformer?

This recipe is a short example on how to convert categorical variables into numerical variables using ColumnTransformer

When we talk about categorical variables, we are talking about non numerical values like Strings or Text. For example, a city name, a store name - for that matter anything that is not a number is a categorical variable. For us humans, it is very easy to interpret and understand text - but for machines it is not the case.

In Machine Learning, all the algorithms in the sklearn library cannot handle categorical variables so before we give the data to the algorithm for training and predicting - we have to convert it to numbers.

So this recipe is a short example on how to convert categorical variables into numerical variables. Let's get started.

```
import numpy as np
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
```

Let's pause and look at these imports. Numpy and Pandas are the usual ones. If you've looked at the older recipes , LabelEncoder and OneHotEncoder was typically used to convert categorical variables. But here we are using ColumnTransformer - why?

In the newer versions of sklearn starting from 0.20 - they have introduced ColumnTransformer. Prior to this, doing this conversion was a 2 step process. First we have to label encode the data and then one hot encode it. But now with ColumnTransformer, they have combined this - so you can do the entire conversion with just one line of code as shown in Step 3 below.

```
dataset = {'research': [15000, 90000, 250000, 175000, 88000, 210000],
'marketing': [5000, 25000, 31000, 44000, 19700, 21111],
'city': ['Texas', 'Delaware', 'Florida', 'Texas', 'Delaware','Florida'],
'profit': [9999, 5555, 3333, 4444, 1111, 2222]
df = pd.DataFrame(dataset)
```

Let us create a simple dataset and convert it to a dataframe. This sample data shows how much a company spends on research and marketing. We will add the state where the company is present. The last variable is the profit that company makes. So in a real example, we will be predicting the profit. But in this recipe, we will not be going into that.

Now our dataset is ready

Before we do that, let's look at the important parameters that we need to pass.

1) transformer

- name: this is just a name that we pass to the transformer
- transformer: here we need to pass an estimator that supports fit and transform. Since we want to encode the data, we will be passing OneHotEncoder here.
- columns: the index of columns that contain the categorical values that we want to convert

2) remainder

This parameter basically tells the transformer what to do with the remaining columns other than the ones we have mentioned above for converting. The values can be "drop" or "passthrough". The default value is drop which means only the transformed columns will be returned by the transformer and the remaining columns will be dropped. But if we want the transformer to pass them - we have to use the value "passthrough"

Now that we understand, let's create the object

```
columnTransformer = ColumnTransformer([('encoder', OneHotEncoder(), [2])], remainder='passthrough')
```

- As you can see, we have named the transformer as "encoder"
- For the estimator, we are passing an object of OneHotEncoder
- And for columns, we can see from our dataset that the 3rd column is the city column that we want to transform. Since the index start at 0, we are passing value 2 here
- Lastly for remainder we are passing "passthrough" as we want the transformer to return the remaining columns

Now that we have got the ColumnTransformer constructor ready, we just have to call the fit_transform method and pass the dataset to it to do the conversion

```
df = np.array(columnTransformer.fit_transform(df), dtype = np.str)
```

Once we run the above code snippet, we will see that all States have been converted to numbers and added to the beginning.

For example :

Texas has been represented by 0,0,1

Delaware by 1,0, 0

Florida by 0,1,0.

```
print(df)
```

[['0.0' '0.0' '1.0' '15000.0' '5000.0' '9999.0'] ['1.0' '0.0' '0.0' '90000.0' '25000.0' '5555.0'] ['0.0' '1.0' '0.0' '250000.0' '31000.0' '3333.0'] ['0.0' '0.0' '1.0' '175000.0' '44000.0' '4444.0'] ['1.0' '0.0' '0.0' '88000.0' '19700.0' '1111.0'] ['0.0' '1.0' '0.0' '210000.0' '21111.0' '2222.0']]

In this machine learning resume parser example we use the popular Spacy NLP python library for OCR and text classification.

In this machine learning churn project, we implement a churn prediction model in python using ensemble techniques.

In this deep learning project, we will predict customer churn using Artificial Neural Networks and learn how to model an ANN in R with the keras deep learning package.

In this data science project in R, we are going to talk about subjective segmentation which is a clustering technique to find out product bundles in sales data.

Data Science Project in R-Predict the sales for each department using historical markdown data from the Walmart dataset containing data of 45 Walmart stores.

In this data science project, you will work with German credit dataset using classification techniques like Decision Tree, Neural Networks etc to classify loan applications using R.

This project analyzes a dataset containing ecommerce product reviews. The goal is to use machine learning models to perform sentiment analysis on product reviews and rank them based on relevance. Reviews play a key role in product recommendation systems.

This data science in python project predicts if a loan should be given to an applicant or not. We predict if the customer is eligible for loan based on several factors like credit score and past history.

In this data science project, you will learn how to perform market basket analysis with the application of Apriori and FP growth algorithms based on the concept of association rule learning.

Deep Learning Project- Learn to apply deep learning paradigm to forecast univariate time series data.