How to convert string categorical variables into numerical variables using Label Encoder?

How to convert string categorical variables into numerical variables using Label Encoder?

How to convert string categorical variables into numerical variables using Label Encoder?

This recipe helps you convert string categorical variables into numerical variables using Label Encoder


Recipe Objective

Many a times while working on a dataset we come across many features that does not have numerical values or which contains multiple labels. These features make the data more understandable and readable for us but the Machine Learning algorithms cannot work on categorical data.

For training and predicting using Machine Learning Algorithms, we have to change categorical data into numerical data and this can be done easily by Label Encoding.

This data science python source code does the following:
1. Convert categorical features into numerical.
2. Implementation of Label Encoding function.

So this recipe is a short example on how to convert categorical variables into numerical variables using Label Encoding. Let's get started.

Step 1 - Import the library - LabelEncoder

import pandas as pd from sklearn.preprocessing import LabelEncoder

Here we have imported Pandas and LabelEncoder which will be used to convert the categorical variables into numerical variables.

Step 2 - Setup the Data

city_data = {'city_level': [1, 3, 1, 2, 2, 3, 1, 1, 2, 3], 'city_pool' : ['y','y','n','y','n','n','y','n','n','y'], 'Rating': [1, 5, 3, 4, 1, 2, 3, 5, 3, 4], 'City_port': [0, 1, 0, 1, 0, 0, 1, 1, 0, 1], 'city_temperature': ['low', 'medium', 'medium', 'high', 'low','low', 'medium', 'medium', 'high', 'low']} df = pd.DataFrame(city_data, columns = ['city_level', 'city_pool', 'Rating', 'City_port', 'city_temperature'])

Let us create a simple dataset and convert it to a dataframe. This is a dataset of city with different features in it like City_level, City_pool, Rating, City_port and City_Temperature. We have converted this dataset into a dataframe with its features as columns.

Clearly, we can see that the features City_pool and City_Temperature have non numerical values. So these two features are categorical features.

Step 3 - Create a function for LabelEncoder

We have created a function named 'Encoder'. In which we will be selecting the columns having categorical values and will perform Label Encoding.

def Encoder(df): columnsToEncode = list(df.select_dtypes(include=['category','object'])) le = LabelEncoder() for feature in columnsToEncode: try: df[feature] = le.fit_transform(df[feature]) except: print('Error encoding '+feature) return df

Now Let us try to understand each statement of the function.
Initially in the function, we have created an object 'columnsToEncode' which will make a list of columns that have of categorical values i.e. the columns having data type 'category' or 'object'. columnsToEncode = list(df.select_dtypes(include=['category','object']))

Now we have to use LabelEncoder. So let us have a look on the parameters and the attributes which we need to pass.
There is one attribute and zero parameter for LabelEncoder. The attribute is:

  • classes_ : It is the array of labels or categorical values.

So, We have created an object for LabelEncoder with no parameters. le = LabelEncoder()

We have created a loop which will iterate over the columns from the list 'columnsToEncode'. In the loop we have used try and except function which consists of 2 blocks, 'try' and 'except'. It works in a manner that first statements inside the try block will execute and if it have some error then only except block will be executed. In the try block we have used the LabelEncoder fit_transform method with the attribute df[feature] and in the except block there is a print statement. for feature in columnsToEncode: try: df[feature] = le.fit_transform(df[feature]) except: print('Error encoding '+feature)

Now we have passed our dataframe through the function. df = Encoder(df)

Step 5 - Lets look at our dataset now

Once we run the above code snippet, we will see that the categorical values in the features City_pool and City_Temperature have been converted into numberical values.

For example in City_Temperature:
low has been represented by 1, medium by 2 and high by 0.

[['1.0' '3.0' '1.0' '2.0' '2.0' '3.0' '1.0' '1.0' '2.0' '3.0']
 ['1.0' '1.0' '0.0' '1.0' '0.0' '0.0' '1.0' '0.0' '0.0' '1.0']
 ['1.0' '5.0' '3.0' '4.0' '1.0' '2.0' '3.0' '5.0' '3.0' '4.0']
 ['0.0' '1.0' '0.0' '1.0' '0.0' '0.0' '1.0' '1.0' '0.0' '1.0']
 ['1.0' '2.0' '2.0' '0.0' '1.0' '1.0' '2.0' '2.0' '0.0' '1.0']]

Relevant Projects

Deep Learning with Keras in R to Predict Customer Churn
In this deep learning project, we will predict customer churn using Artificial Neural Networks and learn how to model an ANN in R with the keras deep learning package.

Data Science Project - Instacart Market Basket Analysis
Data Science Project - Build a recommendation engine which will predict the products to be purchased by an Instacart consumer again.

Machine Learning project for Retail Price Optimization
In this machine learning pricing project, we implement a retail price optimization algorithm using regression trees. This is one of the first steps to building a dynamic pricing model.

Perform Time series modelling using Facebook Prophet
In this project, we are going to talk about Time Series Forecasting to predict the electricity requirement for a particular house using Prophet.

Loan Eligibility Prediction using Gradient Boosting Classifier
This data science in python project predicts if a loan should be given to an applicant or not. We predict if the customer is eligible for loan based on several factors like credit score and past history.

Data Science Project on Wine Quality Prediction in R
In this R data science project, we will explore wine dataset to assess red wine quality. The objective of this data science project is to explore which chemical properties will influence the quality of red wines.

Predict Macro Economic Trends using Kaggle Financial Dataset
In this machine learning project, you will uncover the predictive value in an uncertain world by using various artificial intelligence, machine learning, advanced regression and feature transformation techniques.

Identifying Product Bundles from Sales Data Using R Language
In this data science project in R, we are going to talk about subjective segmentation which is a clustering technique to find out product bundles in sales data.

PySpark Tutorial - Learn to use Apache Spark with Python
PySpark Project-Get a handle on using Python with Spark through this hands-on data processing spark python tutorial.

Predict Employee Computer Access Needs in Python
Data Science Project in Python- Given his or her job role, predict employee access needs using amazon employee database.