Sentiment Analysis using Word2Vec and GloVe Embeddings
Word2Vec and Glove Embeddings for Natural Language Processing
Word embedding is a language modelling technique to represent the words or phrases as vectors. The words are grouped together to get similar representation for words with similar meaning. The word embedding learns the relationship between the words to construct the representation. This is achieved by the various methods like co-occurrence matrix, probabilistic modelling, neural networks.
Word2Vec , GloVe are popular word embeddings. BERT is one of the latest word embedding.
Word embeddings are categorized into 2 types
Frequency based embeddings — Count vector, Co-occurrence vector, HashingVectorizer, TF-IDF
Pre-trained word embeddings — Word2Vec, GloVe, BERT, fastText
In this article we discuss about the Word2Vec and GloVe embeddings and their usage in problems for natural language processing
Pre-Trained Word Embeddings
Pre trained word embeddings use the Neural Networks to train the word embeddings and get the word vectors. The word vectors are obtained by context, co-occurrence of the words , semantic and syntactic similarity.
Word2Vec, Glove, ELMO, Fasttext and BERT are belong to this type of embeddings.
Word2Vec
Word2Vec uses shallow neural networks to learn the embeddings. It is one of the popular word embeddings. It was created by Tomas Mikolov, It has two variants namely CBOW and Skipgrams.
CBOW — Continuous bag of words
CBOW model guesses the next word with the context of the words. In turn it gets the vectors for the word that is guessed. Consider a sentence “The Boy rides Bike”. In CBOW the guessing is done for the Bike depending on the context “The Boy rides”. The vectors for the words “The”, “Boy”, “rides” are passed to a neural network and the vectors for the output are formed. This is compared to the vectors for the output and error is fed back. In turn vectors for the word Bike are created
This picture shows the model for CBOW. It has a shallow single hidden layer neural network. Input layer has the context words X1k — Xck . Each of the words is vector of dimension V. The hidden layer has N nodes. The output Yj is a vector of dimension V. The weights WXN and W|XN will be learnt in the training. The output vectors are learnt in the process of training.
In the picture above C context words are used.
Skip-Gram
Skip Gram model tries to achieve the reverse of the CBOW model. It takes the target word as input and it tries to get the C context words.
Skip Gram is good for the small dataset and if the data set is large it is good to use CBOW.
Glove (Glove: Global Vectors for Word Representation)
The main theme of Glove is to capture the meaning of a word embedding with overall structure of the entire corpus and to derive the relationship between them from global statistics. Most of the unsupervised algorithms are trained using the word frequency and co-occurrence counts of words. Similarly Glove also generates word embedding by training a model based on global co-occurrence counts of words , global statistics and uses mean squared error as the loss function. The generated word embedding with such a model preserves word relationships and similarities. A co-occurrence matrix for a given sentence tells us how often a given pair of words appear together. Each element in the matrix is the count of the pair of the words occurring together.
Consider a sentence “The small red book and the big red book are on the table” for example. Its co-occurrence matrix for this will be like below.
As an example let us now try to develop the relationships between the words from the table above.
For illustration purpose let us consider the word “red”. The conditional probabilities of “red” with respect to the words with which it is paired in the sentence is given below.
P(red/small) = 1
P(red/book) = 0.5
If the ratio of probabilities is computed as given below, it can seen it is greater than 1.
P(red/small) / P(red/book) = 1 / 0.5 = 2
As the ratio of probabilities is greater than 1, it can inferred that the word “red” is more relevant to the word “small” rather than to “book”. Similarly if the ratio computed came closer to 1 then it would infer that both the words “small” and “book” are relevant to the word “red”.
Consider another example sentence, “I eat bread, I love bread and I love biscuits”. The co-occurrence matrix for the sentence is given below.
For illustration purpose let us consider the word “bread”. The conditional probabilities of “bread” with respect to the words with which it is paired in the sentence is given below.
P(bread/eat) = 1
P(bread/love) = 0.5
If the ratio of probabilities is computed as given below, it can seen it is greater than 1.
P(bread/eat) / P (bread/love) = 1 / 0.5 = 2
As the ratio of probabilities is greater than 1, it can inferred that the word “bread” is more relevant to the word “eat” rather than to “love”. Similarly if the ratio computed came closer to 1 then it would infer that both the words “love” and “eat” are relevant to the word “bread”.
Thus the statistics allowed to build the relationship between the words, this is the main concept behind Glove embedding. Let us now get into equations to support it.
Let the co-occurrence matrix be represented by X and each element in the X will represent the number of times the word j occurs in the context of word “i”. With this context, the below conditional probability can be derived.
Pij = P(j/i) = Xij / Xi
The probability Pij gives the probability that the word with index j occurs in the context of word “i”.
As seen from examples above, the ratios of co-occurrence probabilites gives the relevance of the word j. So consider the function F as below.
F (wi, wj, w’k) = Pik / Pjk
Here F is dependent on two word vectors “i” and j and context vector with index k.
With this as starting point and from paper “GloVe: Global Vectors for Word Representation” by Jeffrey Pennington and et al the loss function for the Glove model can be derived. The loss function is given below.
Here ‘f’ is the weighting function, bi, and bj are biases. For detailed derivation of the loss function, please refer to the above paper.
Sentiment Analysis using Word2Vec Embeddings
We try to use the Word2Vec embeddings to the sentiment analysis of the Amazon Music Reviews. This dataset is taken from Kaggle and the dataset has CC0 license.
This dataset contains the reviews on the musical instruments sold on amazon. This has review text, summary and overall rating given by various customers.
In this problem we try to judge the sentiment of the customers by looking at the review text.
We use word2vec embeddings to vectorize the words in the text. This is text classification problem for the review text.
We try to import the required dependencies
Read the data into a dataframe
After this we do a basic pre-processing such as checking if there are any empty values for the columns and changing them as empty space instead of NA
Convert the overall ratings into sentiment good or bad
Take the reviews into an array
Tokenize the text using the Keras tokenizer
Split into training and validation datasets
Now we need to get the pre-trained vectors for the word2vec embeddings. There are different ways to get the pre-trained vectors
Getting the pre-trained vectors from web. We use wget for this.
Now we use load word2vec format from KeyedVectors to get the pre-trained vectors
Another way to load using datapath from genism
We can also get the pre-trained vectors using genism downloader
After this we create an embedding matrix for the tokenized text for the reviews
We create model using word2vec embeddings
We use a fit function to train using this model.
Sentiment Analysis using Glove Embeddings
In the similar way as above, we can use the pre-trained glove embeddings to do the sentiment analysis
The initial steps of preprocessing remain the same
Create an embedding matrix with the pre-trained vectors from Glove Embeddings
Create model with Glove Embeddings
We use Keras fit function to train using the model
Conclusion
The Word2Vec embeddings are learnt based on the context and co-occurrence of the words. The semantic and syntactic relationships are maintained in the vectors . For example man , woman and king and queen , sun and day are given similar vectors. Glove embeddings are based on overall co-occurrence of the words in the corpus.
Word2vec tries to capture the co-occurrence one window at a time but Glove is based on co-occurrence of words in the whole corpus.
Authors:
Srinivas Chakravarthy — srinivas.yeeda@gmail.com
Chandrashekar B N — chandru4ni@gmail.com