How to autotune a custom metric in fasttext

This recipe helps you autotune a custom metric in fasttext
Last Updated: 05 Sep 2022

Get access to Data Science projects View all Data Science projects

MACHINE LEARNING RECIPES DATA CLEANING PYTHON DATA MUNGING PANDAS CHEATSHEET ALL TAGS

Recipe Objective: How to autotune a custom metric in fasttext?

This recipe explains how to autotune a custom metric in fasttext.

Recipe Objective: How to autotune a custom metric in fasttext?

Step 1: Importing library

Let us first import the necessary libraries and download punkt, stopwords, wordnet using nltk.download

import re import nltk import fasttext import numpy as np import pandas as pd import seaborn as sns import matplotlib.pyplot as plt

Step 2: Data Set

We have used the BBC news data set with six unique tags: business, tech, politics, sport, and entertainment. We are plotting several samples in each category by using seaborn.countplot

Step 3: Data Cleaning

Data Cleaning is an essential part of NLP, so we’ll be cleaning the data by following steps.

Tokenization is essentially splitting an input value into smaller units, and each of these smaller units is called a token. It is the first step of data cleaning. This is important because the meaning of the text could easily be understood by analyzing the words present in the text.

data['text_clean'] = data['Text'].apply(nltk.word_tokenize)

Stop words are a list of prevalent but uninformative words that you want to ignore. For tasks like text classification, where the text is to be classified into different categories, stopwords are removed or excluded from the given text. More focus can be given to those words that define the meaning of the text.

stop_words=set(nltk.corpus.stopwords.words("english")) data['text_clean'] = data['text_clean'].apply(lambda x: [item for item in x if item not in stop_words])

Numbers, punctuation, and special characters add noise to the text and are of no use; also, they take unnecessary space in the memory, so we have to remove them.

regex = '[a-z]+' data['text_clean'] = data['text_clean'].apply(lambda x: [item for item in x if re.match(regex, item)])

Lemmatization groups different inflected forms of the word called lemma and maps these words into one common root. It reduces the inflected words properly, ensuring that the root word belongs to the language.

lem = nltk.stem.wordnet.WordNetLemmatizer() data['text_clean'] = data['text_clean'].apply(lambda x: [lem.lemmatize(item, pos='v') for item in x])

Step 3: Autotune a custom metric

Fasttext autotune feature allows you to find the best hyperparameter for your dataset automatically. If you want to optimize the score of a specific category, let’s say entertainment, we’ll execute the following command.

model = fasttext.train_supervised(input='Solution.csv', autotuneValidationFile='BBC News Test.csv', autotuneMetric="f1:entertainment")

What Users are saying..

Ed Godalle

Director Data Analytics at EY / EY Tech

I am the Director of Data Analytics with over 10+ years of IT experience. I have a background in SQL, Python, and Big Data working with Accenture, IBM, and Infosys. I am looking to enhance my skills... Read More