Explain what is a hashing vectorizer?
MACHINE LEARNING RECIPES DATA CLEANING PYTHON DATA MUNGING PANDAS CHEATSHEET     ALL TAGS

Explain what is a hashing vectorizer?

Explain what is a hashing vectorizer?

This recipe explains what is a hashing vectorizer

0

Recipe Objective

what is a hashing vectorizer?

hashing vectorizer is a vectorizer which uses the hashing trick to find the token string name to feature integer index mapping. Conversion of text documents into matrix is done by this vectorizer where it turns the collection of documents into a sparse matrix which are holding the token occurence counts. Advantages for hashing vectorizer are:

As there is no need of storing the vocabulary dictionary in the memory, for large data sets it is very low memory scalable. As there in no state during the fit, it can be used in a streaming or parallel pipeline. And more..

Step 1 - Import the necessary libraries

from sklearn.feature_extraction.text import HashingVectorizer

Step 2 - Take a Sample text

Sample_text = ["Jon is playing football.","He loves to play football.","He is just 10 years old.", "His favorite player is Cristiano Ronaldo."] print(Sample_text)
['Jon is playing football.', 'He loves to play football.', 'He is just 10 years old.', 'His favorite player is Cristiano Ronaldo.']

Step 3 - Save the vectorizer in a variable

My_vect = HashingVectorizer(n_features=2**4)

Step 4 - Fit the sample text into vectorizer

Fit_text = vectorizer.fit_transform(Sample_text)

Step 5 - Print the Results

print(Fit_text, '\n') print(Fit_text.shape)
  (0, 1)	0.5
  (0, 10)	0.5
  (0, 13)	0.5
  (0, 15)	-0.5
  (1, 3)	0.5773502691896258
  (1, 7)	0.5773502691896258
  (1, 10)	0.0
  (1, 11)	0.5773502691896258
  (2, 1)	-0.4082482904638631
  (2, 3)	0.4082482904638631
  (2, 4)	-0.4082482904638631
  (2, 5)	-0.4082482904638631
  (2, 8)	-0.4082482904638631
  (2, 13)	0.4082482904638631
  (3, 0)	0.5
  (3, 2)	-0.5
  (3, 9)	-0.5
  (3, 11)	-0.5
  (3, 13)	0.0 

(4, 16)

Relevant Projects

Credit Card Fraud Detection as a Classification Problem
In this data science project, we will predict the credit card fraud in the transactional dataset using some of the predictive models.

Customer Market Basket Analysis using Apriori and Fpgrowth algorithms
In this data science project, you will learn how to perform market basket analysis with the application of Apriori and FP growth algorithms based on the concept of association rule learning.

Perform Time series modelling using Facebook Prophet
In this project, we are going to talk about Time Series Forecasting to predict the electricity requirement for a particular house using Prophet.

PySpark Tutorial - Learn to use Apache Spark with Python
PySpark Project-Get a handle on using Python with Spark through this hands-on data processing spark python tutorial.

Walmart Sales Forecasting Data Science Project
Data Science Project in R-Predict the sales for each department using historical markdown data from the Walmart dataset containing data of 45 Walmart stores.

Predict Macro Economic Trends using Kaggle Financial Dataset
In this machine learning project, you will uncover the predictive value in an uncertain world by using various artificial intelligence, machine learning, advanced regression and feature transformation techniques.

Data Science Project on Wine Quality Prediction in R
In this R data science project, we will explore wine dataset to assess red wine quality. The objective of this data science project is to explore which chemical properties will influence the quality of red wines.

Predict Churn for a Telecom company using Logistic Regression
Machine Learning Project in R- Predict the customer churn of telecom sector and find out the key drivers that lead to churn. Learn how the logistic regression model using R can be used to identify the customer churn in telecom dataset.

Data Science Project-TalkingData AdTracking Fraud Detection
Machine Learning Project in R-Detect fraudulent click traffic for mobile app ads using R data science programming language.

German Credit Dataset Analysis to Classify Loan Applications
In this data science project, you will work with German credit dataset using classification techniques like Decision Tree, Neural Networks etc to classify loan applications using R.