Sentiment Analysis using Word2Vec and GloVe Embeddings

Word2Vec and Glove Embeddings for Natural Language Processing

8 min readSep 23, 2020

Word embedding is a language modelling technique to represent the words or phrases as vectors. The words are grouped together to get similar representation for words with similar meaning. The word embedding learns the relationship between the words to construct the representation. This is achieved by the various methods like co-occurrence matrix, probabilistic modelling, neural networks.

Word2Vec , GloVe are popular word embeddings. BERT is one of the latest word embedding.

Word embeddings are categorized into 2 types

Frequency based embeddings — Count vector, Co-occurrence vector, HashingVectorizer, TF-IDF

Pre-trained word embeddings — Word2Vec, GloVe, BERT, fastText

In this article we discuss about the Word2Vec and GloVe embeddings and their usage in problems for natural language processing

Pre-Trained Word Embeddings

Pre trained word embeddings use the Neural Networks to train the word embeddings and get the word vectors. The word vectors are obtained by context, co-occurrence of the words , semantic and syntactic similarity.

Word2Vec, Glove, ELMO, Fasttext and BERT are belong to this type of embeddings.

Word2Vec

Word2Vec uses shallow neural networks to learn the embeddings. It is one of the popular word embeddings. It was created by Tomas Mikolov, It has two variants namely CBOW and Skipgrams.

CBOW — Continuous bag of words

CBOW model guesses the next word with the context of the words. In turn it gets the vectors for the word that is guessed. Consider a sentence “The Boy rides Bike”. In CBOW the guessing is done for the Bike depending on the context “The Boy rides”. The vectors for the words “The”, “Boy”, “rides” are passed to a neural network and the vectors for the output are formed. This is compared to the vectors for the output and error is fed back. In turn vectors for the word Bike are created

This picture is taken from Creative Commons — “File:Cbow.png” by Moucrowap is licensed under CC BY-SA 4.0

This picture shows the model for CBOW. It has a shallow single hidden layer neural network. Input layer has the context words X1k — Xck . Each of the words is vector of dimension V. The hidden layer has N nodes. The output Yj is a vector of dimension V. The weights WXN and W|XN will be learnt in the training. The output vectors are learnt in the process of training.

In the picture above C context words are used.

Skip-Gram

Skip Gram model tries to achieve the reverse of the CBOW model. It takes the target word as input and it tries to get the C context words.

Skip Gram is good for the small dataset and if the data set is large it is good to use CBOW.

Glove (Glove: Global Vectors for Word Representation)

The main theme of Glove is to capture the meaning of a word embedding with overall structure of the entire corpus and to derive the relationship between them from global statistics. Most of the unsupervised algorithms are trained using the word frequency and co-occurrence counts of words. Similarly Glove also generates word embedding by training a model based on global co-occurrence counts of words , global statistics and uses mean squared error as the loss function. The generated word embedding with such a model preserves word relationships and similarities. A co-occurrence matrix for a given sentence tells us how often a given pair of words appear together. Each element in the matrix is the count of the pair of the words occurring together.

Consider a sentence “The small red book and the big red book are on the table” for example. Its co-occurrence matrix for this will be like below.

As an example let us now try to develop the relationships between the words from the table above.

For illustration purpose let us consider the word “red”. The conditional probabilities of “red” with respect to the words with which it is paired in the sentence is given below.

P(red/small) = 1

P(red/book) = 0.5

If the ratio of probabilities is computed as given below, it can seen it is greater than 1.

P(red/small) / P(red/book) = 1 / 0.5 = 2

As the ratio of probabilities is greater than 1, it can inferred that the word “red” is more relevant to the word “small” rather than to “book”. Similarly if the ratio computed came closer to 1 then it would infer that both the words “small” and “book” are relevant to the word “red”.

Consider another example sentence, “I eat bread, I love bread and I love biscuits”. The co-occurrence matrix for the sentence is given below.

For illustration purpose let us consider the word “bread”. The conditional probabilities of “bread” with respect to the words with which it is paired in the sentence is given below.

P(bread/eat) = 1

P(bread/love) = 0.5

If the ratio of probabilities is computed as given below, it can seen it is greater than 1.

P(bread/eat) / P (bread/love) = 1 / 0.5 = 2

As the ratio of probabilities is greater than 1, it can inferred that the word “bread” is more relevant to the word “eat” rather than to “love”. Similarly if the ratio computed came closer to 1 then it would infer that both the words “love” and “eat” are relevant to the word “bread”.

Thus the statistics allowed to build the relationship between the words, this is the main concept behind Glove embedding. Let us now get into equations to support it.

Let the co-occurrence matrix be represented by X and each element in the X will represent the number of times the word j occurs in the context of word “i”. With this context, the below conditional probability can be derived.

Pij = P(j/i) = Xij / Xi

The probability Pij gives the probability that the word with index j occurs in the context of word “i”.

As seen from examples above, the ratios of co-occurrence probabilites gives the relevance of the word j. So consider the function F as below.

F (wi, wj, w’k) = Pik / Pjk

Here F is dependent on two word vectors “i” and j and context vector with index k.

With this as starting point and from paper “GloVe: Global Vectors for Word Representation” by Jeffrey Pennington and et al the loss function for the Glove model can be derived. The loss function is given below.

Here ‘f’ is the weighting function, bi, and bj are biases. For detailed derivation of the loss function, please refer to the above paper.

Sentiment Analysis using Word2Vec Embeddings

We try to use the Word2Vec embeddings to the sentiment analysis of the Amazon Music Reviews. This dataset is taken from Kaggle and the dataset has CC0 license.

This dataset contains the reviews on the musical instruments sold on amazon. This has review text, summary and overall rating given by various customers.

In this problem we try to judge the sentiment of the customers by looking at the review text.

We use word2vec embeddings to vectorize the words in the text. This is text classification problem for the review text.

We try to import the required dependencies

Read the data into a dataframe

After this we do a basic pre-processing such as checking if there are any empty values for the columns and changing them as empty space instead of NA

Convert the overall ratings into sentiment good or bad

Take the reviews into an array

Tokenize the text using the Keras tokenizer

Split into training and validation datasets

Now we need to get the pre-trained vectors for the word2vec embeddings. There are different ways to get the pre-trained vectors

Getting the pre-trained vectors from web. We use wget for this.

Now we use load word2vec format from KeyedVectors to get the pre-trained vectors

Another way to load using datapath from genism

We can also get the pre-trained vectors using genism downloader

After this we create an embedding matrix for the tokenized text for the reviews

We create model using word2vec embeddings

We use a fit function to train using this model.

Sentiment Analysis using Glove Embeddings

In the similar way as above, we can use the pre-trained glove embeddings to do the sentiment analysis

The initial steps of preprocessing remain the same

Create an embedding matrix with the pre-trained vectors from Glove Embeddings

Create model with Glove Embeddings

We use Keras fit function to train using the model

Conclusion

The Word2Vec embeddings are learnt based on the context and co-occurrence of the words. The semantic and syntactic relationships are maintained in the vectors . For example man , woman and king and queen , sun and day are given similar vectors. Glove embeddings are based on overall co-occurrence of the words in the corpus.

Word2vec tries to capture the co-occurrence one window at a time but Glove is based on co-occurrence of words in the whole corpus.

Authors:

Srinivas Chakravarthy — srinivas.yeeda@gmail.com

Chandrashekar B N — chandru4ni@gmail.com

gensim: topic modelling for humans

This module implements word vectors and their similarity look-ups. Since trained word vectors are independent from the…

radimrehurek.com

https://github.com/yeedas/Musicreviews

Sentiment Analysis using Word2Vec and GloVe Embeddings

Word2Vec and Glove Embeddings for Natural Language Processing

gensim: topic modelling for humans

This module implements word vectors and their similarity look-ups. Since trained word vectors are independent from the…

Written by Srinivas Chakravarthy

No responses yet