How to find the similarity of a query to every document in Gensim

There is a lot that can be done with an NLP model using Gensim. This recipe explains how to find the similarity of a document using Gensim in python.

Recipe Objective: How to find the similarity of a query document to every document in the corpus?

You can do many fun things with the model once you've finished it. For example

#importing required libraries
from gensim import similarities
from gensim import models
import gensim
from gensim import corpora

#creating a sample corpus for demonstration purpose
txt_corpus = ["This is sample document",
"Collection of documents make a corpus",
"You can vectorize your corpus"]

#creating a set of frequent words
stoplist = set('for a of the and to in on of to are at'.split(' '))

#lowercasing each document, using white space as delimiter and filtering out the stopwords
processed_text = [[word for word in document.lower().split() if word not in stoplist]for document in txt_corpus]

#creating a dictionary
dictionary = corpora.Dictionary(processed_text)

#using doc2bow for vectorization of the entire corpus
bow_vec = [dictionary.doc2bow(text) for text in processed_text]

#training the model
tfidf_model = models.TfidfModel(bow_vec)

#indexing
index = similarities.SparseMatrixSimilarity(tfidf_model[bow_vec], num_features=12)

#finding the similarity of our sample document sample_document against every document in the corpus
sample_document = 'sample corpus'.split()
sample_bow = dictionary.doc2bow(sample_document)
simi = index[tfidf_model[sample_bow]]
print(list(enumerate(simi)))

Output:
[(0, 0.4690727), (1, 0.072158165), (2, 0.062832855)]

Document 0 has a similarity score of 0.469~50%, and document 2 has a similarity score of 7%, etc. We can make this more readable by sorting:

for document_number, score in sorted(enumerate(sims), key=lambda x: x[1], reverse=True):
  print(document_number, score)

Output:
0 0.4690727
1 0.072158165
2 0.062832855

Document 0 is most similar to the sample document.

What Users are saying..

profile image

Jingwei Li

Graduate Research assistance at Stony Brook University
linkedin profile url

ProjectPro is an awesome platform that helps me learn much hands-on industrial experience with a step-by-step walkthrough of projects. There are two primary paths to learn: Data Science and Big Data.... Read More

Relevant Projects

PyCaret Project to Build and Deploy an ML App using Streamlit
In this PyCaret Project, you will build a customer segmentation model with PyCaret and deploy the machine learning application using Streamlit.

Census Income Data Set Project-Predict Adult Census Income
Use the Adult Income dataset to predict whether income exceeds 50K yr based oncensus data.

Build an Image Segmentation Model using Amazon SageMaker
In this Machine Learning Project, you will learn to implement the UNet Architecture and build an Image Segmentation Model using Amazon SageMaker

Credit Card Fraud Detection as a Classification Problem
In this data science project, we will predict the credit card fraud in the transactional dataset using some of the predictive models.

Image Segmentation using Mask R-CNN with Tensorflow
In this Deep Learning Project on Image Segmentation Python, you will learn how to implement the Mask R-CNN model for early fire detection.

OpenCV Project to Master Advanced Computer Vision Concepts
In this OpenCV project, you will learn to implement advanced computer vision concepts and algorithms in OpenCV library using Python.

Build a Similar Images Finder with Python, Keras, and Tensorflow
Build your own image similarity application using Python to search and find images of products that are similar to any given product. You will implement the K-Nearest Neighbor algorithm to find products with maximum similarity.

MLOps Project on GCP using Kubeflow for Model Deployment
MLOps using Kubeflow on GCP - Build and deploy a deep learning model on Google Cloud Platform using Kubeflow pipelines in Python

Build Real Estate Price Prediction Model with NLP and FastAPI
In this Real Estate Price Prediction Project, you will learn to build a real estate price prediction machine learning model and deploy it on Heroku using FastAPI Framework.

Model Deployment on GCP using Streamlit for Resume Parsing
Perform model deployment on GCP for resume parsing model using Streamlit App.