What are embeddings in nlp and how to use them

This recipe explains what are embeddings in nlp and how to use them
Last Updated: 18 Jul 2022

Get access to Data Science projects View all Data Science projects

MACHINE LEARNING RECIPES DATA CLEANING PYTHON DATA MUNGING PANDAS CHEATSHEET ALL TAGS

Recipe Objective

What are embeddings and how to use them? Embeddings translate large sparse vectors into a lower-dimensional space that preserves the semantic relationships. Word embeddings is a technique where individual words of a language are represented as real-valued vectors in a lower-dimensional space. We can also say these are distributed representations of text in an n-dimensional space. Technically speaking, it is a mapping of words into vectors of real numbers using the neural network, probabilistic model, or dimension reduction on word co-occurrence matrix. It is a language modeling and feature learning technique. Word embedding is a way to perform mapping using a neural network.

NLP Techniques to Learn for your Next NLP Project

Step 1 - Import the necessary libraries

import pandas as pd from gensim.models import word2vec

Step 2 - Take a Sample Text

text1 = ["jack wants to play football","Heena also loves to play football"]

Step 3 - Split the text and create a model for it

tokenized_sentences = [sentence.split() for sentence in text1] model1 = word2vec.Word2Vec(tokenized_sentences, min_count=1)

Step 4 - Summarize vocabulary

words = list(model1.wv.vocab) print(words)

['jack', 'wants', 'to', 'play', 'football', 'Heena', 'also', 'loves']

Here we can see, the words which are repeating are not printed only the unique words are getting printed of the sample text.

Step 5 - Access vector for one word

print(model1['football'])

[-1.40790280e-03  4.58865520e-03 -4.95769829e-03 -1.27252412e-03
  4.81374608e-03  2.77659670e-03 -3.98405176e-03  1.86388765e-03
 -3.97940027e-03  4.20716731e-03  4.15110635e-03 -5.57424966e-04
 -2.3193/h2>317e-03 -2.26494414e-03 -4.22752928e-03  3.89819825e-03
 -5.17438224e-04  2.30374443e-03  4.20636032e-03  4.20677802e-03
 -1.40399823e-03  2.67376262e-03  4.15059133e-03 -8.53536942e-04
  4.09730617e-03 -4.61114757e-03  2.81381537e-03  4.06840025e-03
 -2.21697940e-03  2.47436436e-03 -3.31063266e-03 -2.14591250e-03
 -2.03807699e-03 -4.26412933e-03 -1.11343696e-04  5.39611443e-04
  4.11271071e-03 -3.50002461e-04  4.34909156e-03 -3.14325118e-03
 -2.66004843e-03 -4.72667301e-03 -6.80707395e-04 -6.37957186e-04
  9.92335379e-04  5.06919576e-04 -2.30332976e-03  4.67868708e-03
  2.58262083e-03 -4.42665629e-03 -4.33384068e-03  2.00493122e-03
  3.40585801e-04  4.51424671e-03 -2.24930048e-03 -4.74246824e-03
 -4.26648092e-03 -2.76884600e-03 -3.83922178e-03 -3.57130519e-03
  3.80852376e-04  2.10830034e-03  3.99174780e-04 -2.54857983e-03
 -1.73696945e-03 -2.79853819e-03 -3.59335751e-03  1.93190842e-03
  4.62259306e-03  1.84291916e-03  3.57032637e-03  2.30754865e-03
 -4.00394667e-03  1.34957826e-03 -4.16501053e-03 -4.11755871e-03
 -3.26831010e-03  1.22129067e-03 -6.88223168e-04  2.95645348e-03
 -1.37853972e-03 -2.04168772e-03 -2.96842307e-03  8.23099457e-04
  2.57009082e-03  1.67869462e-03  8.10760757e-05 -4.97947959e-03
  1.55272824e-03 -3.07091884e-03 -2.56623537e-03 -1.66870246e-03
 -1.00509136e-03  5.10989048e-05 -1.95662351e-03  1.54431339e-03
 -1.09352660e-03  7.61516392e-04 -8.73727666e-04  6.75187970e-04]
/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:1: DeprecationWarning: Call to deprecated `__getitem__` (Method will be removed in 4.0.0, use self.wv.__getitem__() instead).
  """Entry point for launching an IPython kernel.

Step 6 - Save the model that we have created

model1.save('model1.bin')

Step 7 - load the model

new_model1 = word2vec.Word2Vec.load('model1.bin') print(new_model1)

Word2Vec(vocab=8, size=100, alpha=0.025)

What Users are saying..

Ray han

Tech Leader | Stanford / Yale University

I think that they are fantastic. I attended Yale and Stanford and have worked at Honeywell,Oracle, and Arthur Andersen(Accenture) in the US. I have taken Big Data and Hadoop,NoSQL, Spark, Hadoop... Read More