Explain similarity queries in Gensim

In this recipe, we will learn what are similarity queries in Gensim. We'll also learn how to determine the similarity of two vectors with the help of cosine similarity.
Last Updated: 28 Jul 2022

Get access to Data Science projects View all Data Science projects

MACHINE LEARNING PROJECTS IN PYTHON DATA CLEANING PYTHON DATA MUNGING MACHINE LEARNING RECIPES PANDAS CHEATSHEET ALL TAGS

Recipe Objective: Explain similarity queries in Gensim

We'll search a corpus for documents that are comparable. The first stage is to build a corpus. Let's say a person searched for an "important data science role." We want to rank our four corpus documents in order of decreasing relevance to this question. To determine the similarity of two vectors, we shall use cosine similarity.
To prepare for similarity queries, we must first enter all of the documents that we wish to compare to the results of the following questions. They are the same four documents used to train LSI but in 2-D LSA space. The cosine measure returns similarities in the range (-1, 1) (the higher the score, the greater the similarity).

Complete Guide to Tensorflow for Deep Learning with Python for Free

#importing required libraries from gensim import corpora from collections import defaultdict from gensim import similarities from gensim import models #documents docs = ["Classification of animals plays an important role in science", "Computer science is a great subject", "You can learn data science after learning computer science", "Machine learning is very important while learning data science concepts"] #creating a list of stopwords stoplist = set('for a of the and to in'.split()) #removing the stop words txts = [[word for word in document.lower().split() if word not in stoplist]for document in docs] #calculating frequency of each text frequency = defaultdict(int) for text in txts: for token in text: frequency[token] += 1 #removing words that appear only once txts = [[token for token in text if frequency[token] > 1]for text in txts] #creating a dictionary gensim_dictionary = corpora.Dictionary(txts) #vectorizing the corpus gensim_corpus = [gensim_dictionary.doc2bow(text) for text in txts] #creating LSI model lsi = models.LsiModel(gensim_corpus, id2word=gensim_dictionary, num_topics=2) #query doc = "important data science role" #creating bow vector vec_bow = gensim_dictionary.doc2bow(doc.lower().split()) #converting the query to LSI space vec_lsi = lsi[vec_bow] print("LSI vector\n",vec_lsi) #transforming corpus to LSI space and index it index = similarities.MatrixSimilarity(lsi[gensim_corpus]) #performing a similarity query against the corpus simil = index[vec_lsi] simil=sorted(list(enumerate(simil)),key=lambda item: -item[1]) #printing (document_number, document_similarity) print("Similarity scores for each document\n", simil) print("Similarity scores with document") for doc_position, doc_score in simil: print(doc_score, docs[doc_position])

Output:
LSI vector
 [(0, 1.2356826632186177), (1, 0.05297180608096186)]
Similarity scores for each document
 [(0, 0.9905478), (2, 0.9335148), (3, 0.91683054), (1, 0.7516155)]
Similarity scores with document
0.9905478 Classification of animals plays an important role in science
0.9335148 You can learn data science after learning computer science
0.91683054 Machine learning is very important while learning data science concepts
0.7516155 Computer science is a great subject

So when a person types in the chosen query, documents are listed in the order mentioned above. We can see the first document has the highest similarity score of 0.9905478.

What Users are saying..

Gautam Vermani

Data Consultant at Confidential

Having worked in the field of Data Science, I wanted to explore how I can implement projects in other domains, So I thought of connecting with ProjectPro. A project that helped me absorb this topic... Read More

Relevant Projects

Machine Learning Projects

Data Science Projects

Python Projects for Data Science

Data Science Projects in R

Machine Learning Projects for Beginners

Deep Learning Projects

Neural Network Projects

Tensorflow Projects

NLP Projects

Kaggle Projects

IoT Projects

Big Data Projects

Hadoop Real-Time Projects Examples

Spark Projects

Data Analytics Projects for Students

Relevant Projects

Mastering A/B Testing: A Practical Guide for Production

In this A/B Testing for Machine Learning Project, you will gain hands-on experience in conducting A/B tests, analyzing statistical significance, and understanding the challenges of building a solution for A/B testing in a production environment.

View Project Details

Image Classification Model using Transfer Learning in PyTorch

In this PyTorch Project, you will build an image classification model in PyTorch using the ResNet pre-trained model.

View Project Details

Text Classification with Transformers-RoBERTa and XLNet Model

In this machine learning project, you will learn how to load, fine tune and evaluate various transformer models for text classification tasks.

View Project Details

Loan Eligibility Prediction in Python using H2O.ai

In this loan prediction project you will build predictive models in Python using H2O.ai to predict if an applicant is able to repay the loan or not.

View Project Details

LLM Project to Build and Fine Tune a Large Language Model

In this LLM project for beginners, you will learn to build a knowledge-grounded chatbot using LLM's and learn how to fine tune it.

View Project Details

Build ARCH and GARCH Models in Time Series using Python

In this Project we will build an ARCH and a GARCH model using Python

View Project Details

Build Regression Models in Python for House Price Prediction

In this Machine Learning Regression project, you will build and evaluate various regression models in Python for house price prediction.

View Project Details

ML Model Deployment on AWS for Customer Churn Prediction

MLOps Project-Deploy Machine Learning Model to Production Python on AWS for Customer Churn Prediction

View Project Details

Tensorflow Transfer Learning Model for Image Classification

Image Classification Project - Build an Image Classification Model on a Dataset of T-Shirt Images for Binary Classification

View Project Details

Learn to Build an End-to-End Machine Learning Pipeline - Part 2

In this Machine Learning Project, you will learn how to build an end-to-end machine learning pipeline for predicting truck delays, incorporating Hopsworks' feature store and Weights and Biases for model experimentation.

View Project Details

Explain similarity queries in Gensim

Recipe Objective: Explain similarity queries in Gensim

Gautam Vermani

Relevant Projects

You might also like

Relevant Projects