How to do text classification?
MACHINE LEARNING RECIPES DATA CLEANING PYTHON DATA MUNGING PANDAS CHEATSHEET     ALL TAGS

How to do text classification?

How to do text classification?

This recipe helps you do text classification

0

Recipe Objective

How to do text classification?

text classification is nothing but the process in which the text is assigned to particaular tagg or category depending upon its content. These classification can be used in real world problem for e.g Sentimental Analysis, Spam detection, Analyzing the Customer reviews and many more.

text classification classifiers can be used in organizing, structuring and categorizing for much as any type of text. The text from documents, medical studies and files and also all over the web. For this we are going to use Naive bayes classifier which is considered to be good for text classification.

Step 1 - Import the necessary libraries

import pandas as pd from sklearn.model_selection import train_test_split from sklearn.feature_extraction.text import CountVectorizer from sklearn.naive_bayes import MultinomialNB from sklearn.metrics import accuracy_score, precision_score, recall_score

Step 2 - Read the sample data and print it

df = pd.read_csv('/content/Customer_reviews_data.csv', encoding='cp1252') df.head()

Here we have created a sample dataset of customer reviews, the dataset contains only 10 records in it.

Step 3 - Replace the text with numbers

df['new_Labels'] = df['Labels'].apply(lambda v: 1 if v=='Positive' else 0)

Here we have created a new column as "new_Labels" which contains the integer values of "Labels" column, for "Positive" we have replaced it with "1" and for "Negative" we have replaced it with "0".

df.head() df.tail()

Step 4 - Split the data into train and test

X_train, X_test, y_train, y_test = train_test_split(df['Customer_Reviews'], df['new_Labels'], random_state=1) vectorizer = CountVectorizer(strip_accents='ascii', token_pattern=u'(?ui)\\b\\w*[a-z]+\\w*\\b', lowercase=True, stop_words='english') X_train_cv = vectorizer.fit_transform(X_train) X_test_cv = vectorizer.transform(X_test)

Step 5 - Convert the Customer_Reviews into word count vectors

Word_frequency = pd.DataFrame(X_train_cv.toarray(), columns=vectorizer.get_feature_names()) top_words = pd.DataFrame(Word_frequency.sum()).sort_values(0, ascending=False) print(Word_frequency, '\n') print(top_words)
   bad  best  buy  dont  experience  good  money  product  quality  value
0    0     1    0     0           1     0      0        0        0      0
1    0     0    0     0           0     1      0        1        0      0
2    0     0    0     0           0     1      0        0        0      0
3    0     1    0     0           0     0      0        0        1      0
4    1     0    1     1           0     0      0        1        0      0
5    1     0    0     0           0     0      0        0        0      0
6    0     1    0     0           1     0      1        0        0      1 

            0
best        3
bad         2
experience  2
good        2
product     2
buy         1
dont        1
money       1
quality     1
value       1

Here in the above we have converted the Reviews into vectors, As the naive bayes classifier needs to be able to calculate how many times each word appears in each document and how many times it appears in each category. for Conversion we have used count vectorizer, and also you can see the word frequency and top words in the above.

Step 6 - Fit the model and make the predictions

naive_bayes = MultinomialNB() naive_bayes.fit(X_train_cv, y_train) predictions = naive_bayes.predict(X_test_cv)

Step 7 - Print the results

print('Accuracy score for Customer Reviews model is: ', accuracy_score(y_test, predictions), '\n') print('Precision scorefor Customer Reviews model is: ', precision_score(y_test, predictions), '\n')
Accuracy score for Customer Reviews model is:  0.6666666666666666 
Precision score for Customer Reviews model is:  0.5 

As these are the results based on a sample dataset that only have 10 records, but for more data it will give us more better results. Now we will understand what accuracy and precision score tell us:

Accuracy Score will tell us that out of all the identifications that we have made how many are correct.

Precision Score will tell us that out of all the positive/negative identification we made how many are correct.

Relevant Projects

Machine Learning or Predictive Models in IoT - Energy Prediction Use Case
In this machine learning and IoT project, we are going to test out the experimental data using various predictive models and train the models and break the energy usage.

Sequence Classification with LSTM RNN in Python with Keras
In this project, we are going to work on Sequence to Sequence Prediction using IMDB Movie Review Dataset​ using Keras in Python.

Predict Credit Default | Give Me Some Credit Kaggle
In this data science project, you will predict borrowers chance of defaulting on credit loans by building a credit score prediction model.

Customer Churn Prediction Analysis using Ensemble Techniques
In this machine learning churn project, we implement a churn prediction model in python using ensemble techniques.

Demand prediction of driver availability using multistep time series analysis
In this supervised learning machine learning project, you will predict the availability of a driver in a specific area by using multi step time series analysis.

Loan Eligibility Prediction using Gradient Boosting Classifier
This data science in python project predicts if a loan should be given to an applicant or not. We predict if the customer is eligible for loan based on several factors like credit score and past history.

Learn to prepare data for your next machine learning project
Text data requires special preparation before you can start using it for any machine learning project.In this ML project, you will learn about applying Machine Learning models to create classifiers and learn how to make sense of textual data.

Perform Time series modelling using Facebook Prophet
In this project, we are going to talk about Time Series Forecasting to predict the electricity requirement for a particular house using Prophet.

Credit Card Fraud Detection as a Classification Problem
In this data science project, we will predict the credit card fraud in the transactional dataset using some of the predictive models.

Topic modelling using Kmeans clustering to group customer reviews
In this Kmeans clustering machine learning project, you will perform topic modelling in order to group customer reviews based on recurring patterns.