How to use count vectorizer in nlp

This recipe helps you use count vectorizer in nlp
Last Updated: 18 Jul 2022

Get access to Data Science projects View all Data Science projects

MACHINE LEARNING RECIPES DATA CLEANING PYTHON DATA MUNGING PANDAS CHEATSHEET ALL TAGS

Recipe Objective

How to use count vectorizer? Count Vectorizer is used to convert documents, text into vectors of term or token counts, it involves counting the number of occurences of words appears in a document.

for e.g "I want to go to the park and play the sea-saw".

I - 1

want - 1

to - 2

go - 1

the - 2

park - 1

and - 1

play - 1

sea-saw - 1

So, from the above example we can see it will count the occurences of wordsn appearing in the text. Lets understand with an practical example

NLP Techniques to Learn for your Next NLP Project

Recipe Objective

Step 1 - Import necessary libraries

import pandas as pd from sklearn.feature_extraction.text import CountVectorizer

Step 2 - Take Sample Data

data1 = "I'm designing a document and don't want to get bogged down in what the text actually says" data2 = "I'm creating a template with various paragraph styles and need to see what they will look like." data3 = "I'm trying to learn more about some feature of Microsoft Word and don't want to practice on a real document."

Step 3 - Convert Sample Data into DataFrame using pandas

df1 = pd.DataFrame({'First_Para': [data1], 'Second_Para': [data2], 'Third_Para': [data2]})

Step 4 - Initialize the Vectorizer

count_vectorizer = CountVectorizer() doc_vec = count_vectorizer.fit_transform(df1.iloc[0])

Here we have initialized the vectorizer and fit & transformed the data

Step 5 - Convert the transformed Data into a DataFrame.

df2 = pd.DataFrame(doc_vec.toarray().transpose(), index=vectorizer.get_feature_names())

Step 6 - Change the Column names and print the result

df2.columns = df1.columns print(df2)

           First_Para  Second_Para  Third_Para
actually            1            0           0
and                 1            1           1
bogged              1            0           0
creating            0            1           1
designing           1            0           0
document            1            0           0
don                 1            0           0
down                1            0           0
get                 1            0           0
in                  1            0           0
like                0            1           1
look                0            1           1
need                0            1           1
paragraph           0            1           1
says                1            0           0
see                 0            1           1
styles              0            1           1
template            0            1           1
text                1            0           0
the                 1            0           0
they                0            1           1
to                  1            1           1
various             0            1           1
want                1            0           0
what                1            1           1
will                0            1           1
with                0            1           1

What Users are saying..

Ray han

Tech Leader | Stanford / Yale University

I think that they are fantastic. I attended Yale and Stanford and have worked at Honeywell,Oracle, and Arthur Andersen(Accenture) in the US. I have taken Big Data and Hadoop,NoSQL, Spark, Hadoop... Read More