In Machine Learning projects it is often required to convert categorical data text into numerical formats. Categorical variables are those that have a limited number of fixed values such as Country, Gender, Age etc. These are stored in a text format. Many machine learning models such as regression or SVM, are algebraic and need a numerical input. Before these learning algorithms can be used on a dataset, it has to be converted into numeric.
Hence these categorical values need to be converted to numeric. This is part of the exploratory data analysis (EDA) step in your machine learning project. Variables where the categories are only labeled without any order of precedence are referred to as nominal features. There are multiple ways to do this - Replacing values, Encoding labels, One-Hot encoding, Binary encoding, Backward difference encoding. The 2 most common ways to achieve this are: 1) Label Encoding 2) OneHot Encoding.
One-hot encoding in python takes a column that has categorical data and splits the column into multiple columns. It takes the repeated category values (for example - male, female, USA etc) in a column and groups them into just 1 column value. So any repetition of the category value will be indicated by a number.
In the above recipe example, the column values are names of US states - Texas, Florida, Alabama, Delaware and California. First we create a multilabelbinarizer object. Then we fit and transform the array 'y' with the onehotencoder object we just created.