How to train a cbow model on a corpus of data using fasttext

This recipe helps you train a cbow model on a corpus of data using fasttext

Recipe Objective: How to train a cbow model on a corpus of data using fasttext?

This recipe explains how to train a cbow model on a corpus of data using fasttext.

Step 1: Importing library

Let us first import the necessary libraries and download punkt, stopwords, wordnet using nltk.download

import re
import nltk
import fasttext
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

Step 2: Data Set

We have used the BBC news data set with six unique tags: business, tech, politics, sport, and entertainment. We are plotting several samples in each category by using seaborn.countplot

Step 3: Data Cleaning

Data Cleaning is an essential part of NLP, so we’ll be cleaning the data by following steps.

Tokenization is essentially splitting an input value into smaller units, and each of these smaller units is called a token. It is the first step of data cleaning. This is important because the meaning of the text could easily be understood by analyzing the words present in the text.

data['text_clean'] = data['Text'].apply(nltk.word_tokenize)

Stop words are a list of prevalent but uninformative words that you want to ignore. For tasks like text classification, where the text is to be classified into different categories, stopwords are removed or excluded from the given text. More focus can be given to those words that define the meaning of the text.

stop_words=set(nltk.corpus.stopwords.words("english"))
data['text_clean'] = data['text_clean'].apply(lambda x: [item for item in x if item not in stop_words])

Numbers, punctuation, and special characters add noise to the text and are of no use; also, they take unnecessary space in the memory, so we have to remove them.

regex = '[a-z]+'
data['text_clean'] = data['text_clean'].apply(lambda x: [item for item in x if re.match(regex, item)])

Lemmatization groups different inflected forms of the word called lemma and maps these words into one common root. It reduces the inflected words properly, ensuring that the root word belongs to the language.

lem = nltk.stem.wordnet.WordNetLemmatizer()
data['text_clean'] = data['text_clean'].apply(lambda x: [lem.lemmatize(item, pos='v') for item in x])

Step 3: cbow model

In cbow (“continuous-bag-of-words”) model, a central word is surrounded by a context word. Given the context, word identify the central word. It maximizes the probability of the word based on the word co-occurrences within a distance of n.

model = fasttext.train_unsupervised('Solution.csv', model='cbow')
model.save_model("model.bin")

What Users are saying..

profile image

Jingwei Li

Graduate Research assistance at Stony Brook University
linkedin profile url

ProjectPro is an awesome platform that helps me learn much hands-on industrial experience with a step-by-step walkthrough of projects. There are two primary paths to learn: Data Science and Big Data.... Read More

Relevant Projects

Learn to Build Generative Models Using PyTorch Autoencoders
In this deep learning project, you will learn how to build a Generative Model using Autoencoders in PyTorch

Model Deployment on GCP using Streamlit for Resume Parsing
Perform model deployment on GCP for resume parsing model using Streamlit App.

Build a Autoregressive and Moving Average Time Series Model
In this time series project, you will learn to build Autoregressive and Moving Average Time Series Models to forecast future readings, optimize performance, and harness the power of predictive analytics for sensor data.

Build a Music Recommendation Algorithm using KKBox's Dataset
Music Recommendation Project using Machine Learning - Use the KKBox dataset to predict the chances of a user listening to a song again after their very first noticeable listening event.

End-to-End Speech Emotion Recognition Project using ANN
Speech Emotion Recognition using RAVDESS Audio Dataset - Build an Artificial Neural Network Model to Classify Audio Data into various Emotions like Sad, Happy, Angry, and Neutral

PyCaret Project to Build and Deploy an ML App using Streamlit
In this PyCaret Project, you will build a customer segmentation model with PyCaret and deploy the machine learning application using Streamlit.

Build a Hybrid Recommender System in Python using LightFM
In this Recommender System project, you will build a hybrid recommender system in Python using LightFM .

AWS MLOps Project to Deploy a Classification Model [Banking]
In this AWS MLOps project, you will learn how to deploy a classification model using Flask on AWS.

Deploy Transformer BART Model for Text summarization on GCP
Learn to Deploy a Machine Learning Model for the Abstractive Text Summarization on Google Cloud Platform (GCP)

Time Series Classification Project for Elevator Failure Prediction
In this Time Series Project, you will predict the failure of elevators using IoT sensor data as a time series classification machine learning problem.