HANDS-ON-LAB

Generate Word Embeddings for Bitcoin Project

Problem Statement

Build Bitcoin domain-specific embeddings using Word2vec and FastText in python.

Dataset

This dataset contains the corpora of news articles about Bitcoin. It consists of news articles that are web scraped from various sources on the Internet by using Newscatcher API.

Use the column Summary to train your word embeddings.

Kindly download the data from here.

Tasks

  1. Load the dataset, select only the Summary column. Then create a new column named “word_count” which should have the total number of words in the summary column.

  2. Text preprocessing

    • Remove null values from the Summary column

    • Remove Stopwords from the Summary column (Hint: NLTK)

    • Remove punctuations, url & special characters from the Summary column (Hint: regex)

    • Create a new column named “word_count_clean based on the cleaned summary column

  1. Perform spell check on the data to correct misspelled words and then do lemmatization to reduce the words to its root form. (Hint: For spell check - autocorrect library, pyspellchecker library)

  2. Build a word cloud with the resultant Summary data.

  3. Create word embeddings using Word2Vec and FastText model. Then using the trained model print top 5 similar words to the word “Bitcoin”.

 

Explore the top 5 similar words to "Bitcoin" using the trained Word2Vec and FastText models

FAQs

Q1. What is the source of the Bitcoin news articles dataset?

The dataset consists of news articles web scraped from various sources using the Newscatcher API.

 

Q2. How can I preprocess the text data in the Summary column?

 Perform text preprocessing steps like removing null values, stopwords, punctuations, URLs, and special characters using NLTK and regex libraries.

 

Q3. What techniques are used to create word embeddings?

Word2Vec and FastText models are employed to generate word embeddings for the Bitcoin news dataset.