How to find the similarity of a query to every document in Gensim

There is a lot that can be done with an NLP model using Gensim. This recipe explains how to find the similarity of a document using Gensim in python.
Last Updated: 01 Jul 2022

Get access to Data Science projects View all Data Science projects

MACHINE LEARNING PROJECTS IN PYTHON DATA CLEANING PYTHON DATA MUNGING MACHINE LEARNING RECIPES PANDAS CHEATSHEET ALL TAGS

Recipe Objective: How to find the similarity of a query document to every document in the corpus?

You can do many fun things with the model once you've finished it. For example

#importing required libraries from gensim import similarities from gensim import models import gensim from gensim import corpora #creating a sample corpus for demonstration purpose txt_corpus = ["This is sample document", "Collection of documents make a corpus", "You can vectorize your corpus"] #creating a set of frequent words stoplist = set('for a of the and to in on of to are at'.split(' ')) #lowercasing each document, using white space as delimiter and filtering out the stopwords processed_text = [[word for word in document.lower().split() if word not in stoplist]for document in txt_corpus] #creating a dictionary dictionary = corpora.Dictionary(processed_text) #using doc2bow for vectorization of the entire corpus bow_vec = [dictionary.doc2bow(text) for text in processed_text] #training the model tfidf_model = models.TfidfModel(bow_vec) #indexing index = similarities.SparseMatrixSimilarity(tfidf_model[bow_vec], num_features=12) #finding the similarity of our sample document sample_document against every document in the corpus sample_document = 'sample corpus'.split() sample_bow = dictionary.doc2bow(sample_document) simi = index[tfidf_model[sample_bow]] print(list(enumerate(simi)))

Output:
[(0, 0.4690727), (1, 0.072158165), (2, 0.062832855)]

Document 0 has a similarity score of 0.469~50%, and document 2 has a similarity score of 7%, etc. We can make this more readable by sorting:

for document_number, score in sorted(enumerate(sims), key=lambda x: x[1], reverse=True): print(document_number, score)

Output:
0 0.4690727
1 0.072158165
2 0.062832855

Document 0 is most similar to the sample document.

What Users are saying..

Jingwei Li

Graduate Research assistance at Stony Brook University

ProjectPro is an awesome platform that helps me learn much hands-on industrial experience with a step-by-step walkthrough of projects. There are two primary paths to learn: Data Science and Big Data.... Read More

Relevant Projects

Machine Learning Projects

Data Science Projects

Python Projects for Data Science

Data Science Projects in R

Machine Learning Projects for Beginners

Deep Learning Projects

Neural Network Projects

Tensorflow Projects

NLP Projects

Kaggle Projects

IoT Projects

Big Data Projects

Hadoop Real-Time Projects Examples

Spark Projects

Data Analytics Projects for Students

Relevant Projects

PyCaret Project to Build and Deploy an ML App using Streamlit

In this PyCaret Project, you will build a customer segmentation model with PyCaret and deploy the machine learning application using Streamlit.

View Project Details

Census Income Data Set Project-Predict Adult Census Income

Use the Adult Income dataset to predict whether income exceeds 50K yr based oncensus data.

View Project Details

Build an Image Segmentation Model using Amazon SageMaker

In this Machine Learning Project, you will learn to implement the UNet Architecture and build an Image Segmentation Model using Amazon SageMaker

View Project Details

Credit Card Fraud Detection as a Classification Problem

In this data science project, we will predict the credit card fraud in the transactional dataset using some of the predictive models.

View Project Details

Image Segmentation using Mask R-CNN with Tensorflow

In this Deep Learning Project on Image Segmentation Python, you will learn how to implement the Mask R-CNN model for early fire detection.

View Project Details

OpenCV Project to Master Advanced Computer Vision Concepts

In this OpenCV project, you will learn to implement advanced computer vision concepts and algorithms in OpenCV library using Python.

View Project Details

Build a Similar Images Finder with Python, Keras, and Tensorflow

Build your own image similarity application using Python to search and find images of products that are similar to any given product. You will implement the K-Nearest Neighbor algorithm to find products with maximum similarity.

View Project Details

MLOps Project on GCP using Kubeflow for Model Deployment

MLOps using Kubeflow on GCP - Build and deploy a deep learning model on Google Cloud Platform using Kubeflow pipelines in Python

View Project Details

Build Real Estate Price Prediction Model with NLP and FastAPI

In this Real Estate Price Prediction Project, you will learn to build a real estate price prediction machine learning model and deploy it on Heroku using FastAPI Framework.

View Project Details

Model Deployment on GCP using Streamlit for Resume Parsing

Perform model deployment on GCP for resume parsing model using Streamlit App.

View Project Details

How to find the similarity of a query to every document in Gensim

Recipe Objective: How to find the similarity of a query document to every document in the corpus?

Jingwei Li

Relevant Projects

You might also like

Relevant Projects