How to do text classification in nlp

This recipe helps you do text classification in nlp

Recipe Objective

How to do text classification?

text classification is nothing but the process in which the text is assigned to particaular tagg or category depending upon its content. These classification can be used in real world problem for e.g Sentimental Analysis, Spam detection, Analyzing the Customer reviews and many more.

text classification classifiers can be used in organizing, structuring and categorizing for much as any type of text. The text from documents, medical studies and files and also all over the web. For this we are going to use Naive bayes classifier which is considered to be good for text classification.

NLP Techniques to Learn for your Next NLP Project

Step 1 - Import the necessary libraries

import pandas as pd from sklearn.model_selection import train_test_split from sklearn.feature_extraction.text import CountVectorizer from sklearn.naive_bayes import MultinomialNB from sklearn.metrics import accuracy_score, precision_score, recall_score

Step 2 - Read the sample data and print it

df = pd.read_csv('/content/Customer_reviews_data.csv', encoding='cp1252') df.head()

Here we have created a sample dataset of customer reviews, the dataset contains only 10 records in it.

Step 3 - Replace the text with numbers

df['new_Labels'] = df['Labels'].apply(lambda v: 1 if v=='Positive' else 0)

Here we have created a new column as "new_Labels" which contains the integer values of "Labels" column, for "Positive" we have replaced it with "1" and for "Negative" we have replaced it with "0".

df.head() df.tail()

Step 4 - Split the data into train and test

X_train, X_test, y_train, y_test = train_test_split(df['Customer_Reviews'], df['new_Labels'], random_state=1) vectorizer = CountVectorizer(strip_accents='ascii', token_pattern=u'(?ui)\\b\\w*[a-z]+\\w*\\b', lowercase=True, stop_words='english') X_train_cv = vectorizer.fit_transform(X_train) X_test_cv = vectorizer.transform(X_test)

Step 5 - Convert the Customer_Reviews into word count vectors

Word_frequency = pd.DataFrame(X_train_cv.toarray(), columns=vectorizer.get_feature_names()) top_words = pd.DataFrame(Word_frequency.sum()).sort_values(0, ascending=False) print(Word_frequency, '\n') print(top_words)

   bad  best  buy  dont  experience  good  money  product  quality  value
0    0     1    0     0           1     0      0        0        0      0
1    0     0    0     0           0     1      0        1        0      0
2    0     0    0     0           0     1      0        0        0      0
3    0     1    0     0           0     0      0        0        1      0
4    1     0    1     1           0     0      0        1        0      0
5    1     0    0     0           0     0      0        0        0      0
6    0     1    0     0           1     0      1        0        0      1 

            0
best        3
bad         2
experience  2
good        2
product     2
buy         1
dont        1
money       1
quality     1
value       1

Here in the above we have converted the Reviews into vectors, As the naive bayes classifier needs to be able to calculate how many times each word appears in each document and how many times it appears in each category. for Conversion we have used count vectorizer, and also you can see the word frequency and top words in the above.

Step 6 - Fit the model and make the predictions

naive_bayes = MultinomialNB() naive_bayes.fit(X_train_cv, y_train) predictions = naive_bayes.predict(X_test_cv)

Step 7 - Print the results

print('Accuracy score for Customer Reviews model is: ', accuracy_score(y_test, predictions), '\n') print('Precision scorefor Customer Reviews model is: ', precision_score(y_test, predictions), '\n')

Accuracy score for Customer Reviews model is:  0.6666666666666666 
Precision score for Customer Reviews model is:  0.5 

As these are the results based on a sample dataset that only have 10 records, but for more data it will give us more better results. Now we will understand what accuracy and precision score tell us:

Accuracy Score will tell us that out of all the identifications that we have made how many are correct.

Precision Score will tell us that out of all the positive/negative identification we made how many are correct.

What Users are saying..

profile image

Abhinav Agarwal

Graduate Student at Northwestern University
linkedin profile url

I come from Northwestern University, which is ranked 9th in the US. Although the high-quality academics at school taught me all the basics I needed, obtaining practical experience was a challenge.... Read More

Relevant Projects

Deploying Machine Learning Models with Flask for Beginners
In this MLOps on GCP project you will learn to deploy a sales forecasting ML Model using Flask.

Build a Multi Touch Attribution Machine Learning Model in Python
Identifying the ROI on marketing campaigns is an essential KPI for any business. In this ML project, you will learn to build a Multi Touch Attribution Model in Python to identify the ROI of various marketing efforts and their impact on conversions or sales..

CycleGAN Implementation for Image-To-Image Translation
In this GAN Deep Learning Project, you will learn how to build an image to image translation model in PyTorch with Cycle GAN.

Deep Learning Project for Beginners with Source Code Part 1
Learn to implement deep neural networks in Python .

ML Model Deployment on AWS for Customer Churn Prediction
MLOps Project-Deploy Machine Learning Model to Production Python on AWS for Customer Churn Prediction

Recommender System Machine Learning Project for Beginners-1
Recommender System Machine Learning Project for Beginners - Learn how to design, implement and train a rule-based recommender system in Python

Deploy Transformer BART Model for Text summarization on GCP
Learn to Deploy a Machine Learning Model for the Abstractive Text Summarization on Google Cloud Platform (GCP)

NLP Project for Multi Class Text Classification using BERT Model
In this NLP Project, you will learn how to build a multi-class text classification model using using the pre-trained BERT model.

Create Your First Chatbot with RASA NLU Model and Python
Learn the basic aspects of chatbot development and open source conversational AI RASA to create a simple AI powered chatbot on your own.

Build Time Series Models for Gaussian Processes in Python
Time Series Project - A hands-on approach to Gaussian Processes for Time Series Modelling in Python