How to autotune a custom metric in fasttext

This recipe helps you autotune a custom metric in fasttext

Recipe Objective: How to autotune a custom metric in fasttext?

This recipe explains how to autotune a custom metric in fasttext.

Step 1: Importing library

Let us first import the necessary libraries and download punkt, stopwords, wordnet using nltk.download

import re
import nltk
import fasttext
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

Step 2: Data Set

We have used the BBC news data set with six unique tags: business, tech, politics, sport, and entertainment. We are plotting several samples in each category by using seaborn.countplot

Step 3: Data Cleaning

Data Cleaning is an essential part of NLP, so we’ll be cleaning the data by following steps.

Tokenization is essentially splitting an input value into smaller units, and each of these smaller units is called a token. It is the first step of data cleaning. This is important because the meaning of the text could easily be understood by analyzing the words present in the text.

data['text_clean'] = data['Text'].apply(nltk.word_tokenize)

Stop words are a list of prevalent but uninformative words that you want to ignore. For tasks like text classification, where the text is to be classified into different categories, stopwords are removed or excluded from the given text. More focus can be given to those words that define the meaning of the text.

stop_words=set(nltk.corpus.stopwords.words("english"))
data['text_clean'] = data['text_clean'].apply(lambda x: [item for item in x if item not in stop_words])

Numbers, punctuation, and special characters add noise to the text and are of no use; also, they take unnecessary space in the memory, so we have to remove them.

regex = '[a-z]+'
data['text_clean'] = data['text_clean'].apply(lambda x: [item for item in x if re.match(regex, item)])

Lemmatization groups different inflected forms of the word called lemma and maps these words into one common root. It reduces the inflected words properly, ensuring that the root word belongs to the language.

lem = nltk.stem.wordnet.WordNetLemmatizer()
data['text_clean'] = data['text_clean'].apply(lambda x: [lem.lemmatize(item, pos='v') for item in x])

Step 3: Autotune a custom metric

Fasttext autotune feature allows you to find the best hyperparameter for your dataset automatically. If you want to optimize the score of a specific category, let’s say entertainment, we’ll execute the following command.

model = fasttext.train_supervised(input='Solution.csv', autotuneValidationFile='BBC News Test.csv', autotuneMetric="f1:entertainment")

What Users are saying..

profile image

Anand Kumpatla

Sr Data Scientist @ Doubleslash Software Solutions Pvt Ltd
linkedin profile url

ProjectPro is a unique platform and helps many people in the industry to solve real-life problems with a step-by-step walkthrough of projects. A platform with some fantastic resources to gain... Read More

Relevant Projects

Customer Churn Prediction Analysis using Ensemble Techniques
In this machine learning churn project, we implement a churn prediction model in python using ensemble techniques.

Learn Object Tracking (SOT, MOT) using OpenCV and Python
Get Started with Object Tracking using OpenCV and Python - Learn to implement Multiple Instance Learning Tracker (MIL) algorithm, Generic Object Tracking Using Regression Networks Tracker (GOTURN) algorithm, Kernelized Correlation Filters Tracker (KCF) algorithm, Tracking, Learning, Detection Tracker (TLD) algorithm for single and multiple object tracking from various video clips.

Locality Sensitive Hashing Python Code for Look-Alike Modelling
In this deep learning project, you will find similar images (lookalikes) using deep learning and locality sensitive hashing to find customers who are most likely to click on an ad.

Linear Regression Model Project in Python for Beginners Part 1
Machine Learning Linear Regression Project in Python to build a simple linear regression model and master the fundamentals of regression for beginners.

Azure Text Analytics for Medical Search Engine Deployment
Microsoft Azure Project - Use Azure text analytics cognitive service to deploy a machine learning model into Azure Databricks

Build Time Series Models for Gaussian Processes in Python
Time Series Project - A hands-on approach to Gaussian Processes for Time Series Modelling in Python

MLOps Project on GCP using Kubeflow for Model Deployment
MLOps using Kubeflow on GCP - Build and deploy a deep learning model on Google Cloud Platform using Kubeflow pipelines in Python

AWS MLOps Project for Gaussian Process Time Series Modeling
MLOps Project to Build and Deploy a Gaussian Process Time Series Model in Python on AWS

Build Customer Propensity to Purchase Model in Python
In this machine learning project, you will learn to build a machine learning model to estimate customer propensity to purchase.

Learn to Build a Neural network from Scratch using NumPy
In this deep learning project, you will learn to build a neural network from scratch using NumPy