Construct dictionary without loading all texts in memory in gensim

This Recipe explains how we can construct a dictionary without loading all the texts which are present in the memory using genism in python.

Recipe Objective: How to construct the dictionary without loading all texts into memory?

To construct the dictionary without loading all texts into memory, take a look at the script below-

#importing required library
from gensim import corpora

#creating a list of stopwords
stoplist = set('for a of the and to in'.split())

#collecting statistics about all tokens from a sample file
gensim_dictionary = corpora.Dictionary(line.lower().split() for line in open(r'C:\document.txt', encoding='utf-8'))

#ids of stop words
stop_ids = [gensim_dictionary.token2id[stopword]for stopword in stoplist if stopword in gensim_dictionary.token2id]

#ids of words that appear only once
single_ids = [tokenid for tokenid, docfreq in gensim_dictionary.dfs.items() if docfreq == 1]

#removing stop words and words that appear only once
gensim_dictionary.filter_tokens(stop_ids + single_ids)

#removing gaps in id sequence after words that were removed
gensim_dictionary.compactify()

#displaying the dictionary
print(gensim_dictionary)

Output:
Dictionary(2 unique tokens: ['document', 'corpus'])

What Users are saying..

profile image

Ed Godalle

Director Data Analytics at EY / EY Tech
linkedin profile url

I am the Director of Data Analytics with over 10+ years of IT experience. I have a background in SQL, Python, and Big Data working with Accenture, IBM, and Infosys. I am looking to enhance my skills... Read More

Relevant Projects

Build a Music Recommendation Algorithm using KKBox's Dataset
Music Recommendation Project using Machine Learning - Use the KKBox dataset to predict the chances of a user listening to a song again after their very first noticeable listening event.

Avocado Machine Learning Project Python for Price Prediction
In this ML Project, you will use the Avocado dataset to build a machine learning model to predict the average price of avocado which is continuous in nature based on region and varieties of avocado.

Deep Learning Project for Time Series Forecasting in Python
Deep Learning for Time Series Forecasting in Python -A Hands-On Approach to Build Deep Learning Models (MLP, CNN, LSTM, and a Hybrid Model CNN-LSTM) on Time Series Data.

Build a Multi Class Image Classification Model Python using CNN
This project explains How to build a Sequential Model that can perform Multi Class Image Classification in Python using CNN

Recommender System Machine Learning Project for Beginners-1
Recommender System Machine Learning Project for Beginners - Learn how to design, implement and train a rule-based recommender system in Python

Build OCR from Scratch Python using YOLO and Tesseract
In this deep learning project, you will learn how to build your custom OCR (optical character recognition) from scratch by using Google Tesseract and YOLO to read the text from any images.

Build Customer Propensity to Purchase Model in Python
In this machine learning project, you will learn to build a machine learning model to estimate customer propensity to purchase.

Deploying Machine Learning Models with Flask for Beginners
In this MLOps on GCP project you will learn to deploy a sales forecasting ML Model using Flask.

Deep Learning Project for Text Detection in Images using Python
CV2 Text Detection Code for Images using Python -Build a CRNN deep learning model to predict the single-line text in a given image.

Locality Sensitive Hashing Python Code for Look-Alike Modelling
In this deep learning project, you will find similar images (lookalikes) using deep learning and locality sensitive hashing to find customers who are most likely to click on an ad.