How to convert string categorical variables into numerical variables using Label Encoder?

How to convert string categorical variables into numerical variables using Label Encoder?

How to convert string categorical variables into numerical variables using Label Encoder?

This recipe helps you convert string categorical variables into numerical variables using Label Encoder


Recipe Objective

Many a times while working on a dataset we come across many features that does not have numerical values or which contains multiple labels. These features make the data more understandable and readable for us but the Machine Learning algorithms cannot work on categorical data.

For training and predicting using Machine Learning Algorithms, we have to change categorical data into numerical data and this can be done easily by Label Encoding.

This data science python source code does the following:
1. Convert categorical features into numerical.
2. Implementation of Label Encoding function.

So this recipe is a short example on how to convert categorical variables into numerical variables using Label Encoding. Let's get started.

Step 1 - Import the library - LabelEncoder

import pandas as pd from sklearn.preprocessing import LabelEncoder

Here we have imported Pandas and LabelEncoder which will be used to convert the categorical variables into numerical variables.

Step 2 - Setup the Data

city_data = {'city_level': [1, 3, 1, 2, 2, 3, 1, 1, 2, 3], 'city_pool' : ['y','y','n','y','n','n','y','n','n','y'], 'Rating': [1, 5, 3, 4, 1, 2, 3, 5, 3, 4], 'City_port': [0, 1, 0, 1, 0, 0, 1, 1, 0, 1], 'city_temperature': ['low', 'medium', 'medium', 'high', 'low','low', 'medium', 'medium', 'high', 'low']} df = pd.DataFrame(city_data, columns = ['city_level', 'city_pool', 'Rating', 'City_port', 'city_temperature'])

Let us create a simple dataset and convert it to a dataframe. This is a dataset of city with different features in it like City_level, City_pool, Rating, City_port and City_Temperature. We have converted this dataset into a dataframe with its features as columns.

Clearly, we can see that the features City_pool and City_Temperature have non numerical values. So these two features are categorical features.

Step 3 - Create a function for LabelEncoder

We have created a function named 'Encoder'. In which we will be selecting the columns having categorical values and will perform Label Encoding.

def Encoder(df): columnsToEncode = list(df.select_dtypes(include=['category','object'])) le = LabelEncoder() for feature in columnsToEncode: try: df[feature] = le.fit_transform(df[feature]) except: print('Error encoding '+feature) return df

Now Let us try to understand each statement of the function.
Initially in the function, we have created an object 'columnsToEncode' which will make a list of columns that have of categorical values i.e. the columns having data type 'category' or 'object'. columnsToEncode = list(df.select_dtypes(include=['category','object']))

Now we have to use LabelEncoder. So let us have a look on the parameters and the attributes which we need to pass.
There is one attribute and zero parameter for LabelEncoder. The attribute is:

  • classes_ : It is the array of labels or categorical values.

So, We have created an object for LabelEncoder with no parameters. le = LabelEncoder()

We have created a loop which will iterate over the columns from the list 'columnsToEncode'. In the loop we have used try and except function which consists of 2 blocks, 'try' and 'except'. It works in a manner that first statements inside the try block will execute and if it have some error then only except block will be executed. In the try block we have used the LabelEncoder fit_transform method with the attribute df[feature] and in the except block there is a print statement. for feature in columnsToEncode: try: df[feature] = le.fit_transform(df[feature]) except: print('Error encoding '+feature)

Now we have passed our dataframe through the function. df = Encoder(df)

Step 5 - Lets look at our dataset now

Once we run the above code snippet, we will see that the categorical values in the features City_pool and City_Temperature have been converted into numberical values.

For example in City_Temperature:
low has been represented by 1, medium by 2 and high by 0.

[['1.0' '3.0' '1.0' '2.0' '2.0' '3.0' '1.0' '1.0' '2.0' '3.0']
 ['1.0' '1.0' '0.0' '1.0' '0.0' '0.0' '1.0' '0.0' '0.0' '1.0']
 ['1.0' '5.0' '3.0' '4.0' '1.0' '2.0' '3.0' '5.0' '3.0' '4.0']
 ['0.0' '1.0' '0.0' '1.0' '0.0' '0.0' '1.0' '1.0' '0.0' '1.0']
 ['1.0' '2.0' '2.0' '0.0' '1.0' '1.0' '2.0' '2.0' '0.0' '1.0']]

Relevant Projects

Natural language processing Chatbot application using NLTK for text classification
In this NLP AI application, we build the core conversational engine for a chatbot. We use the popular NLTK text classification library to achieve this.

Zillow’s Home Value Prediction (Zestimate)
Data Science Project in R -Build a machine learning algorithm to predict the future sale prices of homes.

Data Science Project on Wine Quality Prediction in R
In this R data science project, we will explore wine dataset to assess red wine quality. The objective of this data science project is to explore which chemical properties will influence the quality of red wines.

Ecommerce product reviews - Pairwise ranking and sentiment analysis
This project analyzes a dataset containing ecommerce product reviews. The goal is to use machine learning models to perform sentiment analysis on product reviews and rank them based on relevance. Reviews play a key role in product recommendation systems.

Machine Learning project for Retail Price Optimization
In this machine learning pricing project, we implement a retail price optimization algorithm using regression trees. This is one of the first steps to building a dynamic pricing model.

Data Science Project-TalkingData AdTracking Fraud Detection
Machine Learning Project in R-Detect fraudulent click traffic for mobile app ads using R data science programming language.

Machine Learning Project to Forecast Rossmann Store Sales
In this machine learning project you will work on creating a robust prediction model of Rossmann's daily sales using store, promotion, and competitor data.

Predict Churn for a Telecom company using Logistic Regression
Machine Learning Project in R- Predict the customer churn of telecom sector and find out the key drivers that lead to churn. Learn how the logistic regression model using R can be used to identify the customer churn in telecom dataset.

Perform Time series modelling using Facebook Prophet
In this project, we are going to talk about Time Series Forecasting to predict the electricity requirement for a particular house using Prophet.

Credit Card Fraud Detection as a Classification Problem
In this data science project, we will predict the credit card fraud in the transactional dataset using some of the predictive models.