How to train a cbow model on a corpus of data using fasttext

This recipe helps you train a cbow model on a corpus of data using fasttext

Recipe Objective: How to train a cbow model on a corpus of data using fasttext?

This recipe explains how to train a cbow model on a corpus of data using fasttext.

Step 1: Importing library

Let us first import the necessary libraries and download punkt, stopwords, wordnet using nltk.download

import re
import nltk
import fasttext
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

Step 2: Data Set

We have used the BBC news data set with six unique tags: business, tech, politics, sport, and entertainment. We are plotting several samples in each category by using seaborn.countplot

Step 3: Data Cleaning

Data Cleaning is an essential part of NLP, so we’ll be cleaning the data by following steps.

Tokenization is essentially splitting an input value into smaller units, and each of these smaller units is called a token. It is the first step of data cleaning. This is important because the meaning of the text could easily be understood by analyzing the words present in the text.

data['text_clean'] = data['Text'].apply(nltk.word_tokenize)

Stop words are a list of prevalent but uninformative words that you want to ignore. For tasks like text classification, where the text is to be classified into different categories, stopwords are removed or excluded from the given text. More focus can be given to those words that define the meaning of the text.

stop_words=set(nltk.corpus.stopwords.words("english"))
data['text_clean'] = data['text_clean'].apply(lambda x: [item for item in x if item not in stop_words])

Numbers, punctuation, and special characters add noise to the text and are of no use; also, they take unnecessary space in the memory, so we have to remove them.

regex = '[a-z]+'
data['text_clean'] = data['text_clean'].apply(lambda x: [item for item in x if re.match(regex, item)])

Lemmatization groups different inflected forms of the word called lemma and maps these words into one common root. It reduces the inflected words properly, ensuring that the root word belongs to the language.

lem = nltk.stem.wordnet.WordNetLemmatizer()
data['text_clean'] = data['text_clean'].apply(lambda x: [lem.lemmatize(item, pos='v') for item in x])

Step 3: cbow model

In cbow (“continuous-bag-of-words”) model, a central word is surrounded by a context word. Given the context, word identify the central word. It maximizes the probability of the word based on the word co-occurrences within a distance of n.

model = fasttext.train_unsupervised('Solution.csv', model='cbow')
model.save_model("model.bin")

What Users are saying..

profile image

Anand Kumpatla

Sr Data Scientist @ Doubleslash Software Solutions Pvt Ltd
linkedin profile url

ProjectPro is a unique platform and helps many people in the industry to solve real-life problems with a step-by-step walkthrough of projects. A platform with some fantastic resources to gain... Read More

Relevant Projects

Build Deep Autoencoders Model for Anomaly Detection in Python
In this deep learning project , you will build and deploy a deep autoencoders model using Flask.

Build CNN for Image Colorization using Deep Transfer Learning
Image Processing Project -Train a model for colorization to make grayscale images colorful using convolutional autoencoders.

Build a Text Generator Model using Amazon SageMaker
In this Deep Learning Project, you will train a Text Generator Model on Amazon Reviews Dataset using LSTM Algorithm in PyTorch and deploy it on Amazon SageMaker.

Build CNN Image Classification Models for Real Time Prediction
Image Classification Project to build a CNN model in Python that can classify images into social security cards, driving licenses, and other key identity information.

Insurance Pricing Forecast Using XGBoost Regressor
In this project, we are going to talk about insurance forecast by using linear and xgboost regression techniques.

Demand prediction of driver availability using multistep time series analysis
In this supervised learning machine learning project, you will predict the availability of a driver in a specific area by using multi step time series analysis.

Loan Eligibility Prediction in Python using H2O.ai
In this loan prediction project you will build predictive models in Python using H2O.ai to predict if an applicant is able to repay the loan or not.

Multi-Class Text Classification with Deep Learning using BERT
In this deep learning project, you will implement one of the most popular state of the art Transformer models, BERT for Multi-Class Text Classification

GCP MLOps Project to Deploy ARIMA Model using uWSGI Flask
Build an end-to-end MLOps Pipeline to deploy a Time Series ARIMA Model on GCP using uWSGI and Flask

Recommender System Machine Learning Project for Beginners-4
Collaborative Filtering Recommender System Project - Comparison of different model based and memory based methods to build recommendation system using collaborative filtering.