How to fine tune a fasttext model

This recipe helps you fine tune a fasttext model

Recipe Objective: How to fine-tune a fasttext model?

This recipe explains how to fine-tune a fasttext model.

Step 1: Importing library

Let us first import the necessary libraries and download punkt, stopwords, wordnet using nltk.download

import re
import nltk
import fasttext
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

Step 2: Data Set

We have used the BBC news data set with six unique tags: business, tech, politics, sport, and entertainment. We are plotting several samples in each category by using seaborn.countplot

Step 3: Data Cleaning

Data Cleaning is an essential part of NLP, so we’ll be cleaning the data by following steps.

Tokenization is essentially splitting an input value into smaller units, and each of these smaller units is called a token. It is the first step of data cleaning. This is important because the meaning of the text could easily be understood by analyzing the words present in the text.

data['text_clean'] = data['Text'].apply(nltk.word_tokenize)

Stop words are a list of prevalent but uninformative words that you want to ignore. For tasks like text classification, where the text is to be classified into different categories, stopwords are removed or excluded from the given text. More focus can be given to those words that define the meaning of the text.

stop_words=set(nltk.corpus.stopwords.words("english"))
data['text_clean'] = data['text_clean'].apply(lambda x: [item for item in x if item not in stop_words])

Numbers, punctuation, and special characters add noise to the text and are of no use; also, they take unnecessary space in the memory, so we have to remove them.

regex = '[a-z]+'
data['text_clean'] = data['text_clean'].apply(lambda x: [item for item in x if re.match(regex, item)])

Lemmatization groups different inflected forms of the word called lemma and maps these words into one common root. It reduces the inflected words properly, ensuring that the root word belongs to the language.

lem = nltk.stem.wordnet.WordNetLemmatizer()
data['text_clean'] = data['text_clean'].apply(lambda x: [lem.lemmatize(item, pos='v') for item in x])

Step 4: Autotune a custom metric

Fasttext autotune feature allows you to find the best hyperparameter for your dataset automatically. Hyperparameters are always fine-tuned.

model = fasttext.train_supervised(input='Solution.csv', autotuneValidationFile='BBC News Test.csv', autotuneDuration=600)

What Users are saying..

profile image

Gautam Vermani

Data Consultant at Confidential
linkedin profile url

Having worked in the field of Data Science, I wanted to explore how I can implement projects in other domains, So I thought of connecting with ProjectPro. A project that helped me absorb this topic... Read More

Relevant Projects

Build a Collaborative Filtering Recommender System in Python
Use the Amazon Reviews/Ratings dataset of 2 Million records to build a recommender system using memory-based collaborative filtering in Python.

Build a Multi Touch Attribution Machine Learning Model in Python
Identifying the ROI on marketing campaigns is an essential KPI for any business. In this ML project, you will learn to build a Multi Touch Attribution Model in Python to identify the ROI of various marketing efforts and their impact on conversions or sales..

Deep Learning Project for Text Detection in Images using Python
CV2 Text Detection Code for Images using Python -Build a CRNN deep learning model to predict the single-line text in a given image.

Build a Autoregressive and Moving Average Time Series Model
In this time series project, you will learn to build Autoregressive and Moving Average Time Series Models to forecast future readings, optimize performance, and harness the power of predictive analytics for sensor data.

Learn to Build a Siamese Neural Network for Image Similarity
In this Deep Learning Project, you will learn how to build a siamese neural network with Keras and Tensorflow for Image Similarity.

Loan Default Prediction Project using Explainable AI ML Models
Loan Default Prediction Project that employs sophisticated machine learning models, such as XGBoost and Random Forest and delves deep into the realm of Explainable AI, ensuring every prediction is transparent and understandable.

Word2Vec and FastText Word Embedding with Gensim in Python
In this NLP Project, you will learn how to use the popular topic modelling library Gensim for implementing two state-of-the-art word embedding methods Word2Vec and FastText models.

End-to-End ML Model Monitoring using Airflow and Docker
In this MLOps Project, you will learn to build an end to end pipeline to monitor any changes in the predictive power of model or degradation of data.

Expedia Hotel Recommendations Data Science Project
In this data science project, you will contextualize customer data and predict the likelihood a customer will stay at 100 different hotel groups.

Build a Churn Prediction Model using Ensemble Learning
Learn how to build ensemble machine learning models like Random Forest, Adaboost, and Gradient Boosting for Customer Churn Prediction using Python