Explain similarity queries in Gensim

In this recipe, we will learn what are similarity queries in Gensim. We'll also learn how to determine the similarity of two vectors with the help of cosine similarity.

Recipe Objective: Explain similarity queries in Gensim

We'll search a corpus for documents that are comparable. The first stage is to build a corpus. Let's say a person searched for an "important data science role." We want to rank our four corpus documents in order of decreasing relevance to this question. To determine the similarity of two vectors, we shall use cosine similarity.
To prepare for similarity queries, we must first enter all of the documents that we wish to compare to the results of the following questions. They are the same four documents used to train LSI but in 2-D LSA space. The cosine measure returns similarities in the range (-1, 1) (the higher the score, the greater the similarity).

                                           Complete Guide to Tensorflow for Deep Learning with Python for Free


#importing required libraries
from gensim import corpora
from collections import defaultdict
from gensim import similarities
from gensim import models

#documents
docs = ["Classification of animals plays an important role in science",
"Computer science is a great subject",
"You can learn data science after learning computer science",
"Machine learning is very important while learning data science concepts"]

#creating a list of stopwords
stoplist = set('for a of the and to in'.split())

#removing the stop words
txts = [[word for word in document.lower().split() if word not in stoplist]for document in docs]

#calculating frequency of each text
frequency = defaultdict(int)
for text in txts:
for token in text:
frequency[token] += 1

#removing words that appear only once
txts = [[token for token in text if frequency[token] > 1]for text in txts]

#creating a dictionary
gensim_dictionary = corpora.Dictionary(txts)

#vectorizing the corpus
gensim_corpus = [gensim_dictionary.doc2bow(text) for text in txts]

#creating LSI model
lsi = models.LsiModel(gensim_corpus, id2word=gensim_dictionary, num_topics=2)

#query
doc = "important data science role"

#creating bow vector
vec_bow = gensim_dictionary.doc2bow(doc.lower().split())

#converting the query to LSI space
vec_lsi = lsi[vec_bow]

print("LSI vector\n",vec_lsi)

#transforming corpus to LSI space and index it
index = similarities.MatrixSimilarity(lsi[gensim_corpus])

#performing a similarity query against the corpus
simil = index[vec_lsi]
simil=sorted(list(enumerate(simil)),key=lambda item: -item[1])

#printing (document_number, document_similarity)
print("Similarity scores for each document\n", simil)

print("Similarity scores with document")
for doc_position, doc_score in simil:
print(doc_score, docs[doc_position])

Output:
LSI vector
 [(0, 1.2356826632186177), (1, 0.05297180608096186)]
Similarity scores for each document
 [(0, 0.9905478), (2, 0.9335148), (3, 0.91683054), (1, 0.7516155)]
Similarity scores with document
0.9905478 Classification of animals plays an important role in science
0.9335148 You can learn data science after learning computer science
0.91683054 Machine learning is very important while learning data science concepts
0.7516155 Computer science is a great subject

So when a person types in the chosen query, documents are listed in the order mentioned above. We can see the first document has the highest similarity score of 0.9905478.

What Users are saying..

profile image

Gautam Vermani

Data Consultant at Confidential
linkedin profile url

Having worked in the field of Data Science, I wanted to explore how I can implement projects in other domains, So I thought of connecting with ProjectPro. A project that helped me absorb this topic... Read More

Relevant Projects

Mastering A/B Testing: A Practical Guide for Production
In this A/B Testing for Machine Learning Project, you will gain hands-on experience in conducting A/B tests, analyzing statistical significance, and understanding the challenges of building a solution for A/B testing in a production environment.

Image Classification Model using Transfer Learning in PyTorch
In this PyTorch Project, you will build an image classification model in PyTorch using the ResNet pre-trained model.

Text Classification with Transformers-RoBERTa and XLNet Model
In this machine learning project, you will learn how to load, fine tune and evaluate various transformer models for text classification tasks.

Loan Eligibility Prediction in Python using H2O.ai
In this loan prediction project you will build predictive models in Python using H2O.ai to predict if an applicant is able to repay the loan or not.

LLM Project to Build and Fine Tune a Large Language Model
In this LLM project for beginners, you will learn to build a knowledge-grounded chatbot using LLM's and learn how to fine tune it.

Build ARCH and GARCH Models in Time Series using Python
In this Project we will build an ARCH and a GARCH model using Python

Build Regression Models in Python for House Price Prediction
In this Machine Learning Regression project, you will build and evaluate various regression models in Python for house price prediction.

ML Model Deployment on AWS for Customer Churn Prediction
MLOps Project-Deploy Machine Learning Model to Production Python on AWS for Customer Churn Prediction

Tensorflow Transfer Learning Model for Image Classification
Image Classification Project - Build an Image Classification Model on a Dataset of T-Shirt Images for Binary Classification

Learn to Build an End-to-End Machine Learning Pipeline - Part 2
In this Machine Learning Project, you will learn how to build an end-to-end machine learning pipeline for predicting truck delays, incorporating Hopsworks' feature store and Weights and Biases for model experimentation.