Explain what is a hashing vectorizer?

Explain what is a hashing vectorizer?

Explain what is a hashing vectorizer?

This recipe explains what is a hashing vectorizer


Recipe Objective

what is a hashing vectorizer?

hashing vectorizer is a vectorizer which uses the hashing trick to find the token string name to feature integer index mapping. Conversion of text documents into matrix is done by this vectorizer where it turns the collection of documents into a sparse matrix which are holding the token occurence counts. Advantages for hashing vectorizer are:

As there is no need of storing the vocabulary dictionary in the memory, for large data sets it is very low memory scalable. As there in no state during the fit, it can be used in a streaming or parallel pipeline. And more..

Step 1 - Import the necessary libraries

from sklearn.feature_extraction.text import HashingVectorizer

Step 2 - Take a Sample text

Sample_text = ["Jon is playing football.","He loves to play football.","He is just 10 years old.", "His favorite player is Cristiano Ronaldo."] print(Sample_text)
['Jon is playing football.', 'He loves to play football.', 'He is just 10 years old.', 'His favorite player is Cristiano Ronaldo.']

Step 3 - Save the vectorizer in a variable

My_vect = HashingVectorizer(n_features=2**4)

Step 4 - Fit the sample text into vectorizer

Fit_text = vectorizer.fit_transform(Sample_text)

Step 5 - Print the Results

print(Fit_text, '\n') print(Fit_text.shape)
  (0, 1)	0.5
  (0, 10)	0.5
  (0, 13)	0.5
  (0, 15)	-0.5
  (1, 3)	0.5773502691896258
  (1, 7)	0.5773502691896258
  (1, 10)	0.0
  (1, 11)	0.5773502691896258
  (2, 1)	-0.4082482904638631
  (2, 3)	0.4082482904638631
  (2, 4)	-0.4082482904638631
  (2, 5)	-0.4082482904638631
  (2, 8)	-0.4082482904638631
  (2, 13)	0.4082482904638631
  (3, 0)	0.5
  (3, 2)	-0.5
  (3, 9)	-0.5
  (3, 11)	-0.5
  (3, 13)	0.0 

(4, 16)

Relevant Projects

Natural language processing Chatbot application using NLTK for text classification
In this NLP AI application, we build the core conversational engine for a chatbot. We use the popular NLTK text classification library to achieve this.

Solving Multiple Classification use cases Using H2O
In this project, we are going to talk about H2O and functionality in terms of building Machine Learning models.

Customer Churn Prediction Analysis using Ensemble Techniques
In this machine learning churn project, we implement a churn prediction model in python using ensemble techniques.

Data Science Project in Python on BigMart Sales Prediction
The goal of this data science project is to build a predictive model and find out the sales of each product at a given Big Mart store.

Ensemble Machine Learning Project - All State Insurance Claims Severity Prediction
In this ensemble machine learning project, we will predict what kind of claims an insurance company will get. This is implemented in python using ensemble machine learning algorithms.

Machine Learning project for Retail Price Optimization
In this machine learning pricing project, we implement a retail price optimization algorithm using regression trees. This is one of the first steps to building a dynamic pricing model.

Walmart Sales Forecasting Data Science Project
Data Science Project in R-Predict the sales for each department using historical markdown data from the Walmart dataset containing data of 45 Walmart stores.

Deep Learning with Keras in R to Predict Customer Churn
In this deep learning project, we will predict customer churn using Artificial Neural Networks and learn how to model an ANN in R with the keras deep learning package.

Perform Time series modelling using Facebook Prophet
In this project, we are going to talk about Time Series Forecasting to predict the electricity requirement for a particular house using Prophet.

Predict Employee Computer Access Needs in Python
Data Science Project in Python- Given his or her job role, predict employee access needs using amazon employee database.