How to get similar word vectors of a word using fasttext

This recipe helps you get similar word vectors of a word using fasttext
Last Updated: 11 Nov 2021

Get access to Data Science projects View all Data Science projects

MACHINE LEARNING RECIPES DATA CLEANING PYTHON DATA MUNGING PANDAS CHEATSHEET ALL TAGS

Recipe Objective: How to get similar word vectors of a word using fasttext?

This recipe explains how to get similar word vectors of a word using fasttext.

Step 1: Importing library

Let us first import the necessary libraries and download punkt, stopwords, wordnet using nltk.download

import re import nltk import fasttext import numpy as np import pandas as pd import seaborn as sns import matplotlib.pyplot as plt

Step 2: Data Set

We have used the BBC news data set with six unique tags: business, tech, politics, sport, and entertainment. We are plotting several samples in each category by using seaborn.countplot

Step 3: Data Cleaning

Data Cleaning is an essential part of NLP, so we’ll be cleaning the data by following steps.

Tokenization is essentially splitting an input value into smaller units, and each of these smaller units is called a token. It is the first step of data cleaning. This is important because the meaning of the text could easily be understood by analyzing the words present in the text.

data['text_clean'] = data['Text'].apply(nltk.word_tokenize)

Stop words are a list of prevalent but uninformative words that you want to ignore. For tasks like text classification, where the text is to be classified into different categories, stopwords are removed or excluded from the given text. More focus can be given to those words that define the meaning of the text.

stop_words=set(nltk.corpus.stopwords.words("english")) data['text_clean'] = data['text_clean'].apply(lambda x: [item for item in x if item not in stop_words])

Numbers, punctuation, and special characters add noise to the text and are of no use; also, they take unnecessary space in the memory, so we have to remove them.

regex = '[a-z]+' data['text_clean'] = data['text_clean'].apply(lambda x: [item for item in x if re.match(regex, item)])

Lemmatization groups different inflected forms of the word called lemma and maps these words into one common root. It reduces the inflected words properly, ensuring that the root word belongs to the language.

lem = nltk.stem.wordnet.WordNetLemmatizer() data['text_clean'] = data['text_clean'].apply(lambda x: [lem.lemmatize(item, pos='v') for item in x])

Step 3: Similar Word Vectors

In cbow (“continuous-bag-of-words”) model, a central word is surrounded by a context word. Given the context, word identify the central word. It maximizes the probability of the word based on the word co-occurrences within a distance of n.

model = fasttext.train_unsupervised('Solution.csv', model='cbow') model.save_model("model.bin")

get_subwords: Given a word, get the subword and their indices
get_subwords_id: Given a subword, return the index (within input matrix) it hashes to.
get_word_vector: Get the vector representation of a word
get_word_id: Given a word, get the word id within a dictionary

model.get_word_vector("labour") model.get_subwords("labour") model.get_subword_id("labour") model.get_word_id("labour")

What Users are saying..

Ed Godalle

Director Data Analytics at EY / EY Tech

I am the Director of Data Analytics with over 10+ years of IT experience. I have a background in SQL, Python, and Big Data working with Accenture, IBM, and Infosys. I am looking to enhance my skills... Read More

Relevant Projects

Machine Learning Projects

Data Science Projects

Python Projects for Data Science

Data Science Projects in R

Machine Learning Projects for Beginners

Deep Learning Projects

Neural Network Projects

Tensorflow Projects

NLP Projects

Kaggle Projects

IoT Projects

Big Data Projects

Hadoop Real-Time Projects Examples

Spark Projects

Data Analytics Projects for Students

Relevant Projects

Build a CNN Model with PyTorch for Image Classification

In this deep learning project, you will learn how to build an Image Classification Model using PyTorch CNN

View Project Details

A/B Testing Approach for Comparing Performance of ML Models

The objective of this project is to compare the performance of BERT and DistilBERT models for building an efficient Question and Answering system. Using A/B testing approach, we explore the effectiveness and efficiency of both models and determine which one is better suited for Q&A tasks.

View Project Details

AWS MLOps Project to Deploy Multiple Linear Regression Model

Build and Deploy a Multiple Linear Regression Model in Python on AWS

View Project Details

Time Series Analysis with Facebook Prophet Python and Cesium

Time Series Analysis Project - Use the Facebook Prophet and Cesium Open Source Library for Time Series Forecasting in Python

View Project Details

Learn to Build an End-to-End Machine Learning Pipeline - Part 1

In this Machine Learning Project, you will learn how to build an end-to-end machine learning pipeline for predicting truck delays, addressing a major challenge in the logistics industry.

View Project Details

End-to-End Snowflake Healthcare Analytics Project on AWS-1

In this Snowflake Healthcare Analytics Project, you will leverage Snowflake on AWS to predict patient length of stay (LOS) in hospitals. The prediction of LOS can help in efficient resource allocation, lower the risk of staff/visitor infections, and improve overall hospital functioning.

View Project Details

How to get similar word vectors of a word using fasttext

Recipe Objective: How to get similar word vectors of a word using fasttext?

Step 1: Importing library

Step 2: Data Set

Step 3: Data Cleaning

Step 3: Similar Word Vectors

Ed Godalle

Relevant Projects

You might also like

Relevant Projects