How to create a word cloud in Python?

This recipe helps you understand create a word cloud in Python

How to create a Wordcloud in Python

In this tutorial, let us understand how to generate wordclouds in python! Yes you read it right. Wordclouds!

Wordcloud is basically a visualization technique to represent the frequency of words in a text where the size of the word represents its frequency.

In order to work with wordclouds in python, we will first have to install a few libraries using pip. They are numpy (for array manipulation), pillow (for image handling), matplotlib (for generating plots) and finally wordcloud (for generating wordclouds).


    pip install numpy
    pip install pillow
    pip install matplotlib
    pip install wordcloud

In this tutorial, we will be using the Restaurant reviews data from kaggle. The dataset can be found here.

To begin with, we will firstly have to import the necessary libraries.


    # Importing Libraries
    import pandas as pd
    import matplotlib.pyplot as plt
    from wordcloud import WordCloud, STOPWORDS

Now, let us read the data as a dataframe using pandas


    # Importing Dataset
    df = pd.read_csv("Restaurant_Reviews.tsv", sep="	")

The data contains two columns namely 'Review' and 'Liked'. The 'Review' column is the actual review written by the customer and the 'Liked' column is a binary variable which states whether or not the customer liked the food.

The next step is to generate a text variable which contains all the reviews combined as a single string. This can be done using the join function available in python.


    #Creating the text variable
    text = " ".join(cat for cat in df.Review)

Now, we have reached the important step of creating the wordcloud! We can generate the wordcloud in the following manner.


    # Generate word cloud
    word_cloud = WordCloud(
        width=3000,
        height=2000,
        random_state=1,
        background_color="salmon",
        colormap="Pastel1",
        collocations=False,
        stopwords=STOPWORDS,
        ).generate(text
    

The WordCloud function provides a lot of parameters that we can tweak according to our desire. Let us understand a few of them.

  • width/height : To adjust the height and width of the wordcloud
  • random_state : To recreate the same plot every time we run the function. The random_state parameter has to be an integer value.
  • background_color : To set a background_color. The default value for this parameter is 'black'. This page displays a list of colours that can be used.
  • colormap : To set up the color theme for the words. This link provides a list of colormaps that can be used. The default value is 'viridis'
  • collocations : To include bigrams of two words when set to True. The default value is True
  • stopwords : To set the list of words that needs to be eliminated. This list can include trivial words like this, that, is, was, the, etc. If this parameter is set to None, then function will consider a built-in list of STOPWORDS
  • max_font_size : To set the maximum font size of the largest word.
  • normalize_plurals : To keep or remove the trailing 's' from the words

Now comes the last step where we plot the generated wordcloud using the imshow() function of matplotlib


    # Display the generated Word Cloud
    plt.imshow(word_cloud)
    plt.axis("off")
    plt.show()

Complete Code

Output

Let us make this a little more interesting! So far, we saw how to generate the wordcloud on a plain canvas. What if I say that you can create these word clouds in different shapes? Sounds interesting, Isn,t it? Well that is not difficult at all. All we have to do is pick an image of the shape that you would like the wordcloud to be. For example, I have considered the following image.

Let us make use of the pillow and numpy libraries to read this image and store this in a variable called mask.


    # Importing Libraries
    import pandas as pd
    import matplotlib.pyplot as plt
    from wordcloud import WordCloud, STOPWORDS
    from PIL import Image
    import numpy as np

    # Import image to np.array
    mask = np.array(Image.open('comment.png'))

We can pass this mask as a parameter to the wordcloud function like so.


    # Generate word cloud
    word_cloud2 = WordCloud(
        width=3000,
        height=2000,
        random_state=123,
        background_color="purple",
        colormap="Set2",
        collocations=False,
        stopwords=STOPWORDS,
        mask=mask
    ).generate(text)

Complete code

Output

Ta-da! There you go! Hope you enjoyed reading this tutorial as much as I did while writing :)

Download Materials

What Users are saying..

profile image

Ray han

Tech Leader | Stanford / Yale University
linkedin profile url

I think that they are fantastic. I attended Yale and Stanford and have worked at Honeywell,Oracle, and Arthur Andersen(Accenture) in the US. I have taken Big Data and Hadoop,NoSQL, Spark, Hadoop... Read More

Relevant Projects

NLP Project for Multi Class Text Classification using BERT Model
In this NLP Project, you will learn how to build a multi-class text classification model using using the pre-trained BERT model.

CycleGAN Implementation for Image-To-Image Translation
In this GAN Deep Learning Project, you will learn how to build an image to image translation model in PyTorch with Cycle GAN.

Tensorflow Transfer Learning Model for Image Classification
Image Classification Project - Build an Image Classification Model on a Dataset of T-Shirt Images for Binary Classification

PyCaret Project to Build and Deploy an ML App using Streamlit
In this PyCaret Project, you will build a customer segmentation model with PyCaret and deploy the machine learning application using Streamlit.

Deploying Machine Learning Models with Flask for Beginners
In this MLOps on GCP project you will learn to deploy a sales forecasting ML Model using Flask.

Multi-Class Text Classification with Deep Learning using BERT
In this deep learning project, you will implement one of the most popular state of the art Transformer models, BERT for Multi-Class Text Classification

Locality Sensitive Hashing Python Code for Look-Alike Modelling
In this deep learning project, you will find similar images (lookalikes) using deep learning and locality sensitive hashing to find customers who are most likely to click on an ad.

Credit Card Fraud Detection as a Classification Problem
In this data science project, we will predict the credit card fraud in the transactional dataset using some of the predictive models.

Model Deployment on GCP using Streamlit for Resume Parsing
Perform model deployment on GCP for resume parsing model using Streamlit App.

Many-to-One LSTM for Sentiment Analysis and Text Generation
In this LSTM Project , you will build develop a sentiment detection model using many-to-one LSTMs for accurate prediction of sentiment labels in airline text reviews. Additionally, we will also train many-to-one LSTMs on 'Alice's Adventures in Wonderland' to generate contextually relevant text.