SciPy Cosine Similarity - Formula, Calculation & Implementation

This code example will help you understand SciPy Cosine Similarity - its formula, calculation methods, & step-by-step implementation strategies. | ProjectPro

Cosine similarity addresses many challenges encountered in data science projects when dealing with high-dimensional data, capturing semantic similarity, and scalability, making it a must know concept for every data scientist.

 

Understanding the similarity between objects or data points lies at the core of various analytical tasks across diverse domains, from document analysis to recommendation systems. One of the fundamental metrics for measuring similarity is cosine similarity, which is particularly prominent in fields like natural language processing, information retrieval, and recommendation systems to measure the similarity between two vectors. Users can efficiently compute the cosine similarity between two vectors or even between two sets of vectors. This short guide will help you understand the concept of Cosine similarity, including its formula and calculation, with the help of examples. So, let’s dive in! 

What is SciPy Cosine Similarity? 

Cosine similarity is a metric used to measure the similarity between two vectors, irrespective of their magnitude. It calculates the cosine of the angle between the vectors, which reflects how similar they are in the direction. This measure is widely used in information retrieval, text mining, and recommendation systems.

SciPy Cosine Similarity Formula 

The formula for computing cosine similarity involves the dot product of the two vectors divided by the product of their Euclidean norms. Mathematically:

 

Similarity=AB/∣∣A∣∣⋅∣∣B∣∣

 

Where A and B are vectors, A.B denotes the dot product, and ||A|| and ||B|| represent the Euclidean norms of A and B respectively. 

How to Calculate Cosine Similarity? - Step-by-Step Guide 

This step-by-step guide provides a comprehensive walkthrough on calculating cosine similarity, a widely used measure in text mining and information retrieval that facilitates easy comparison between documents or vectors. 

Step 1 - Import the library

 

from scipy import spatial

Let's pause and look at these imports. We have imported a spatial library from the SciPy class. Scipy contains several scientific routines, such as solving differential equations.

Step 2 - Setup the Data

 

x=[1,2,3]

y=[-1,-2,-3] 

Let us create two vector lists.

Step 3 - Calculating Cosine Similarity

 

z=1-spatial.distance.cosine(x,y)

We first calculated cosine distance and subtracted it from 1, giving us cosine similarity.

Step 4 – Printing Results

 

print(z)

Simply use print function to print a new appended list.

Step 5 - Let's look at our dataset now

Once we run the above code snippet, we will see:

-1.0

How to Implement Cosine Similarity Between Two Vectors in Python? 

The cosine similarity between two vectors in Python can be implemented efficiently using NumPy. First, ensure both vectors are represented as arrays. Then, calculate the dot product of the two vectors using NumPy's dot() function. Next, compute the magnitudes of each vector using numpy.linalg.norm(). Finally, divide the dot product by the product of the magnitudes to obtain the cosine similarity. Check out the example below - 

Cosine Similarity Example

 

Check out the Python example below demonstrating how to use SciPy to calculate the cosine similarity between two vectors - 



Python Cosine Similarity

 

This indicates a high similarity between the two vectors. Remember that cosine similarity values range from -1 to 1, where 1 indicates identical vectors, 0 indicates orthogonal (unrelated) vectors, and -1 indicates exactly opposite vectors.

Cosine Distance vs. Cosine Similarity: The Difference 

Cosine distance and cosine similarity are fundamental concepts in natural language processing (NLP) and are crucial for tasks like semantic similarity measurement and clustering. Cosine similarity quantifies the similarity between two vectors by calculating the cosine of the angle between them, ranging from -1 to 1. A value of -1 indicates absolute dissimilarity, 0 suggests no correlation, and 1 signifies perfect similarity. On the other hand, cosine distance is derived from cosine similarity and measures the dissimilarity between vectors, ranging from 0 to 2. It complements cosine similarity by emphasizing differences rather than similarities. The choice between cosine similarity and cosine distance in practical applications depends on the task. Normalization techniques can affect cosine similarity calculations; for instance, Z-score normalization alters the results by changing the mean and standard deviation.  

 

Advance your Python Skills with ProjectPro! 

We've seen how cosine similarity is a robust measure for quantifying the similarity between vectors, making it invaluable for tasks like document comparison, content recommendation, and clustering of similar items. Calculating the cosine similarity between vectors representing documents can help you efficiently identify similarities in their content, aiding in tasks such as plagiarism detection or document clustering. Furthermore, real-world examples, such as comparing textual data and user-item interactions in recommendation systems, showcase the practical utility of cosine similarity in various domains. Hands-on practice with real-world Python projects is crucial for mastering its implementation and gaining valuable insights into data analysis and machine learning. ProjectPro is your go-to resource during your learning journey, offering guided projects to cover topics such as SciPy's cosine similarity function comprehensively. With ProjectPro, you can delve into practical applications, gaining hands-on experience and mastering concepts with real-world projects. So, check out ProjectPro Repository to solidify your understanding, enhance your skills, and confidently apply cosine similarity in your data science projects.  

What Users are saying..

profile image

Savvy Sahai

Data Science Intern, Capgemini
linkedin profile url

As a student looking to break into the field of data engineering and data science, one can get really confused as to which path to take. Very few ways to do it are Google, YouTube, etc. I was one of... Read More

Relevant Projects

Classification Projects on Machine Learning for Beginners - 2
Learn to implement various ensemble techniques to predict license status for a given business.

Deploying Machine Learning Models with Flask for Beginners
In this MLOps on GCP project you will learn to deploy a sales forecasting ML Model using Flask.

Predict Churn for a Telecom company using Logistic Regression
Machine Learning Project in R- Predict the customer churn of telecom sector and find out the key drivers that lead to churn. Learn how the logistic regression model using R can be used to identify the customer churn in telecom dataset.

Learn How to Build a Logistic Regression Model in PyTorch
In this Machine Learning Project, you will learn how to build a simple logistic regression model in PyTorch for customer churn prediction.

Linear Regression Model Project in Python for Beginners Part 2
Machine Learning Linear Regression Project for Beginners in Python to Build a Multiple Linear Regression Model on Soccer Player Dataset.

Build Multi Class Text Classification Models with RNN and LSTM
In this Deep Learning Project, you will use the customer complaints data about consumer financial products to build multi-class text classification models using RNN and LSTM.

Build a Multi-Class Classification Model in Python on Saturn Cloud
In this machine learning classification project, you will build a multi-class classification model in Python on Saturn Cloud to predict the license status of a business.

Recommender System Machine Learning Project for Beginners-3
Content Based Recommender System Project - Building a Content-Based Product Recommender App with Streamlit

Build Regression (Linear,Ridge,Lasso) Models in NumPy Python
In this machine learning regression project, you will learn to build NumPy Regression Models (Linear Regression, Ridge Regression, Lasso Regression) from Scratch.

Build a Similar Images Finder with Python, Keras, and Tensorflow
Build your own image similarity application using Python to search and find images of products that are similar to any given product. You will implement the K-Nearest Neighbor algorithm to find products with maximum similarity.