How to convert string categorical variables into numerical variables using Label Encoder?
DATA MUNGING DATA CLEANING PYTHON MACHINE LEARNING RECIPES PANDAS CHEATSHEET     ALL TAGS

How to convert string categorical variables into numerical variables using Label Encoder?

How to convert string categorical variables into numerical variables using Label Encoder?

This recipe helps you convert string categorical variables into numerical variables using Label Encoder

4

Recipe Objective

Many a times while working on a dataset we come across many features that does not have numerical values or which contains multiple labels. These features make the data more understandable and readable for us but the Machine Learning algorithms cannot work on categorical data.

For training and predicting using Machine Learning Algorithms, we have to change categorical data into numerical data and this can be done easily by Label Encoding.

This data science python source code does the following:
1. Convert categorical features into numerical.
2. Implementation of Label Encoding function.

So this recipe is a short example on how to convert categorical variables into numerical variables using Label Encoding. Let's get started.

Step 1 - Import the library - LabelEncoder

import pandas as pd from sklearn.preprocessing import LabelEncoder

Here we have imported Pandas and LabelEncoder which will be used to convert the categorical variables into numerical variables.

Step 2 - Setup the Data

city_data = {'city_level': [1, 3, 1, 2, 2, 3, 1, 1, 2, 3], 'city_pool' : ['y','y','n','y','n','n','y','n','n','y'], 'Rating': [1, 5, 3, 4, 1, 2, 3, 5, 3, 4], 'City_port': [0, 1, 0, 1, 0, 0, 1, 1, 0, 1], 'city_temperature': ['low', 'medium', 'medium', 'high', 'low','low', 'medium', 'medium', 'high', 'low']} df = pd.DataFrame(city_data, columns = ['city_level', 'city_pool', 'Rating', 'City_port', 'city_temperature'])

Let us create a simple dataset and convert it to a dataframe. This is a dataset of city with different features in it like City_level, City_pool, Rating, City_port and City_Temperature. We have converted this dataset into a dataframe with its features as columns.

Clearly, we can see that the features City_pool and City_Temperature have non numerical values. So these two features are categorical features.

Step 3 - Create a function for LabelEncoder

We have created a function named 'Encoder'. In which we will be selecting the columns having categorical values and will perform Label Encoding.

def Encoder(df): columnsToEncode = list(df.select_dtypes(include=['category','object'])) le = LabelEncoder() for feature in columnsToEncode: try: df[feature] = le.fit_transform(df[feature]) except: print('Error encoding '+feature) return df

Now Let us try to understand each statement of the function.
Initially in the function, we have created an object 'columnsToEncode' which will make a list of columns that have of categorical values i.e. the columns having data type 'category' or 'object'. columnsToEncode = list(df.select_dtypes(include=['category','object']))

Now we have to use LabelEncoder. So let us have a look on the parameters and the attributes which we need to pass.
There is one attribute and zero parameter for LabelEncoder. The attribute is:

  • classes_ : It is the array of labels or categorical values.

So, We have created an object for LabelEncoder with no parameters. le = LabelEncoder()

We have created a loop which will iterate over the columns from the list 'columnsToEncode'. In the loop we have used try and except function which consists of 2 blocks, 'try' and 'except'. It works in a manner that first statements inside the try block will execute and if it have some error then only except block will be executed. In the try block we have used the LabelEncoder fit_transform method with the attribute df[feature] and in the except block there is a print statement. for feature in columnsToEncode: try: df[feature] = le.fit_transform(df[feature]) except: print('Error encoding '+feature)

Now we have passed our dataframe through the function. df = Encoder(df)

Step 5 - Lets look at our dataset now

Once we run the above code snippet, we will see that the categorical values in the features City_pool and City_Temperature have been converted into numberical values.

For example in City_Temperature:
low has been represented by 1, medium by 2 and high by 0.

print(df)
[['1.0' '3.0' '1.0' '2.0' '2.0' '3.0' '1.0' '1.0' '2.0' '3.0']
 ['1.0' '1.0' '0.0' '1.0' '0.0' '0.0' '1.0' '0.0' '0.0' '1.0']
 ['1.0' '5.0' '3.0' '4.0' '1.0' '2.0' '3.0' '5.0' '3.0' '4.0']
 ['0.0' '1.0' '0.0' '1.0' '0.0' '0.0' '1.0' '1.0' '0.0' '1.0']
 ['1.0' '2.0' '2.0' '0.0' '1.0' '1.0' '2.0' '2.0' '0.0' '1.0']]

Relevant Projects

Forecast Inventory demand using historical sales data in R
In this machine learning project, you will develop a machine learning model to accurately forecast inventory demand based on historical sales data.

German Credit Dataset Analysis to Classify Loan Applications
In this data science project, you will work with German credit dataset using classification techniques like Decision Tree, Neural Networks etc to classify loan applications using R.

Learn to prepare data for your next machine learning project
Text data requires special preparation before you can start using it for any machine learning project.In this ML project, you will learn about applying Machine Learning models to create classifiers and learn how to make sense of textual data.

Predict Churn for a Telecom company using Logistic Regression
Machine Learning Project in R- Predict the customer churn of telecom sector and find out the key drivers that lead to churn. Learn how the logistic regression model using R can be used to identify the customer churn in telecom dataset.

Human Activity Recognition Using Smartphones Data Set
In this deep learning project, you will build a classification system where to precisely identify human fitness activities.

Ensemble Machine Learning Project - All State Insurance Claims Severity Prediction
In this ensemble machine learning project, we will predict what kind of claims an insurance company will get. This is implemented in python using ensemble machine learning algorithms.

Customer Churn Prediction Analysis using Ensemble Techniques
In this machine learning churn project, we implement a churn prediction model in python using ensemble techniques.

Solving Multiple Classification use cases Using H2O
In this project, we are going to talk about H2O and functionality in terms of building Machine Learning models.

Predict Census Income using Deep Learning Models
In this project, we are going to work on Deep Learning using H2O to predict Census income.

Choosing the right Time Series Forecasting Methods
There are different time series forecasting methods to forecast stock price, demand etc. In this machine learning project, you will learn to determine which forecasting method to be used when and how to apply with time series forecasting example.