How to do text classification?

How to do text classification?

How to do text classification?

This recipe helps you do text classification


Recipe Objective

How to do text classification?

text classification is nothing but the process in which the text is assigned to particaular tagg or category depending upon its content. These classification can be used in real world problem for e.g Sentimental Analysis, Spam detection, Analyzing the Customer reviews and many more.

text classification classifiers can be used in organizing, structuring and categorizing for much as any type of text. The text from documents, medical studies and files and also all over the web. For this we are going to use Naive bayes classifier which is considered to be good for text classification.

Step 1 - Import the necessary libraries

import pandas as pd from sklearn.model_selection import train_test_split from sklearn.feature_extraction.text import CountVectorizer from sklearn.naive_bayes import MultinomialNB from sklearn.metrics import accuracy_score, precision_score, recall_score

Step 2 - Read the sample data and print it

df = pd.read_csv('/content/Customer_reviews_data.csv', encoding='cp1252') df.head()

Here we have created a sample dataset of customer reviews, the dataset contains only 10 records in it.

Step 3 - Replace the text with numbers

df['new_Labels'] = df['Labels'].apply(lambda v: 1 if v=='Positive' else 0)

Here we have created a new column as "new_Labels" which contains the integer values of "Labels" column, for "Positive" we have replaced it with "1" and for "Negative" we have replaced it with "0".

df.head() df.tail()

Step 4 - Split the data into train and test

X_train, X_test, y_train, y_test = train_test_split(df['Customer_Reviews'], df['new_Labels'], random_state=1) vectorizer = CountVectorizer(strip_accents='ascii', token_pattern=u'(?ui)\\b\\w*[a-z]+\\w*\\b', lowercase=True, stop_words='english') X_train_cv = vectorizer.fit_transform(X_train) X_test_cv = vectorizer.transform(X_test)

Step 5 - Convert the Customer_Reviews into word count vectors

Word_frequency = pd.DataFrame(X_train_cv.toarray(), columns=vectorizer.get_feature_names()) top_words = pd.DataFrame(Word_frequency.sum()).sort_values(0, ascending=False) print(Word_frequency, '\n') print(top_words)
   bad  best  buy  dont  experience  good  money  product  quality  value
0    0     1    0     0           1     0      0        0        0      0
1    0     0    0     0           0     1      0        1        0      0
2    0     0    0     0           0     1      0        0        0      0
3    0     1    0     0           0     0      0        0        1      0
4    1     0    1     1           0     0      0        1        0      0
5    1     0    0     0           0     0      0        0        0      0
6    0     1    0     0           1     0      1        0        0      1 

best        3
bad         2
experience  2
good        2
product     2
buy         1
dont        1
money       1
quality     1
value       1

Here in the above we have converted the Reviews into vectors, As the naive bayes classifier needs to be able to calculate how many times each word appears in each document and how many times it appears in each category. for Conversion we have used count vectorizer, and also you can see the word frequency and top words in the above.

Step 6 - Fit the model and make the predictions

naive_bayes = MultinomialNB(), y_train) predictions = naive_bayes.predict(X_test_cv)

Step 7 - Print the results

print('Accuracy score for Customer Reviews model is: ', accuracy_score(y_test, predictions), '\n') print('Precision scorefor Customer Reviews model is: ', precision_score(y_test, predictions), '\n')
Accuracy score for Customer Reviews model is:  0.6666666666666666 
Precision score for Customer Reviews model is:  0.5 

As these are the results based on a sample dataset that only have 10 records, but for more data it will give us more better results. Now we will understand what accuracy and precision score tell us:

Accuracy Score will tell us that out of all the identifications that we have made how many are correct.

Precision Score will tell us that out of all the positive/negative identification we made how many are correct.

Relevant Projects

Predict Churn for a Telecom company using Logistic Regression
Machine Learning Project in R- Predict the customer churn of telecom sector and find out the key drivers that lead to churn. Learn how the logistic regression model using R can be used to identify the customer churn in telecom dataset.

Data Science Project on Wine Quality Prediction in R
In this R data science project, we will explore wine dataset to assess red wine quality. The objective of this data science project is to explore which chemical properties will influence the quality of red wines.

Predict Census Income using Deep Learning Models
In this project, we are going to work on Deep Learning using H2O to predict Census income.

Predict Macro Economic Trends using Kaggle Financial Dataset
In this machine learning project, you will uncover the predictive value in an uncertain world by using various artificial intelligence, machine learning, advanced regression and feature transformation techniques.

Ecommerce product reviews - Pairwise ranking and sentiment analysis
This project analyzes a dataset containing ecommerce product reviews. The goal is to use machine learning models to perform sentiment analysis on product reviews and rank them based on relevance. Reviews play a key role in product recommendation systems.

Deep Learning with Keras in R to Predict Customer Churn
In this deep learning project, we will predict customer churn using Artificial Neural Networks and learn how to model an ANN in R with the keras deep learning package.

Choosing the right Time Series Forecasting Methods
There are different time series forecasting methods to forecast stock price, demand etc. In this machine learning project, you will learn to determine which forecasting method to be used when and how to apply with time series forecasting example.

Machine Learning or Predictive Models in IoT - Energy Prediction Use Case
In this machine learning and IoT project, we are going to test out the experimental data using various predictive models and train the models and break the energy usage.

Machine Learning project for Retail Price Optimization
In this machine learning pricing project, we implement a retail price optimization algorithm using regression trees. This is one of the first steps to building a dynamic pricing model.

Sequence Classification with LSTM RNN in Python with Keras
In this project, we are going to work on Sequence to Sequence Prediction using IMDB Movie Review Dataset​ using Keras in Python.