What is jaccard similarity and how to calculate it?

What is jaccard similarity and how to calculate it?

What is jaccard similarity and how to calculate it?

This recipe explains what is jaccard similarity and how to calculate it


Recipe Objective

Jaccard similarity can be defined to the size of intersection divided by the size of union of two sets. Hence it lies between values 0 & 1. In lay man's term, it is area of overlap/area of union.

So this recipe is a short example on what jaccard similarity is and how to calculate it. Let's get started.

Step 1 - Setup the Data

x=['Ram','Shyam','Rohan'] y=['Ram','Rohan','Ganesh']

Let us create a two list having two common elements.

Step 2 - Defining Jaccard function

def jaccard(x,y): z=set(x).intersection(set(y)) a=float(len(z))/(len(x)+len(y)-len(z)) return a

We have used the mathematical property of jacccard function to defined the values to be returned if two list are passed into it as arguments.

Step 3 - Calling function and printing results

z=jaccard(x,y) print(z)

First call the jaccard function and store the return value in any random variables. Now simply use print function to print new appended dataframe.

Step 4 - Let's look at our dataset now

Once we run the above code snippet, we will see:


For above example, we can observe that the area of intersection will be 2 elements and area of overlap will be 4 elements. So jacarrad similarity is 2/4 i.e. '0.5'.

Relevant Projects

Learn to prepare data for your next machine learning project
Text data requires special preparation before you can start using it for any machine learning project.In this ML project, you will learn about applying Machine Learning models to create classifiers and learn how to make sense of textual data.

Predict Churn for a Telecom company using Logistic Regression
Machine Learning Project in R- Predict the customer churn of telecom sector and find out the key drivers that lead to churn. Learn how the logistic regression model using R can be used to identify the customer churn in telecom dataset.

Data Science Project in Python on BigMart Sales Prediction
The goal of this data science project is to build a predictive model and find out the sales of each product at a given Big Mart store.

Data Science Project-TalkingData AdTracking Fraud Detection
Machine Learning Project in R-Detect fraudulent click traffic for mobile app ads using R data science programming language.

Credit Card Fraud Detection as a Classification Problem
In this data science project, we will predict the credit card fraud in the transactional dataset using some of the predictive models.

Customer Churn Prediction Analysis using Ensemble Techniques
In this machine learning churn project, we implement a churn prediction model in python using ensemble techniques.

Customer Market Basket Analysis using Apriori and Fpgrowth algorithms
In this data science project, you will learn how to perform market basket analysis with the application of Apriori and FP growth algorithms based on the concept of association rule learning.

Natural language processing Chatbot application using NLTK for text classification
In this NLP AI application, we build the core conversational engine for a chatbot. We use the popular NLTK text classification library to achieve this.

Build an Image Classifier for Plant Species Identification
In this machine learning project, we will use binary leaf images and extracted features, including shape, margin, and texture to accurately identify plant species using different benchmark classification techniques.

Predict Credit Default | Give Me Some Credit Kaggle
In this data science project, you will predict borrowers chance of defaulting on credit loans by building a credit score prediction model.