HomeNLPGetting Started with Word2Vec and GloVe in Python
Deep Learning Specialization on Coursera

We have talked about “Getting Started with Word2Vec and GloVe“, and how to use them in a pure python environment? Here we wil tell you how to use word2vec and glove by python.

Word2Vec in Python

The great topic modeling tool gensim has implemented the word2vec in python, you should install gensim first, then use word2vec like this:

In [1]: from gensim.models import word2vec
 
In [2]: import logging
 
In [3]: logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
 
In [4]: sentences = word2vec.Text8Corpus('text8')
 
In [5]: model = word2vec.Word2Vec(sentences, size=200)
2015-02-24 11:14:15,428 : INFO : collecting all words and their counts
2015-02-24 11:14:15,429 : INFO : PROGRESS: at sentence #0, processed 0 words and 0 word types
2015-02-24 11:14:22,863 : INFO : PROGRESS: at sentence #10000, processed 10000000 words and 189074 word types
2015-02-24 11:14:28,218 : INFO : collected 253854 word types from a corpus of 17005207 words and 17006 sentences
2015-02-24 11:14:28,362 : INFO : total 71290 word types after removing those with count<5
2015-02-24 11:14:28,362 : INFO : constructing a huffman tree from 71290 words
2015-02-24 11:14:32,431 : INFO : built huffman tree with maximum node depth 22
2015-02-24 11:14:32,509 : INFO : resetting layer weights
2015-02-24 11:14:34,279 : INFO : training model with 1 workers on 71290 vocabulary and 200 features, using 'skipgram'=1 'hierarchical softmax'=1 'subsample'=0 and 'negative sampling'=0
2015-02-24 11:14:35,550 : INFO : PROGRESS: at 0.59% words, alpha 0.02500, 77772 words/s
2015-02-24 11:14:36,581 : INFO : PROGRESS: at 1.18% words, alpha 0.02485, 85486 words/s
2015-02-24 11:14:37,661 : INFO : PROGRESS: at 1.77% words, alpha 0.02471, 87258 words/s
...
2015-02-24 11:17:56,434 : INFO : PROGRESS: at 99.38% words, alpha 0.00030, 82190 words/s
2015-02-24 11:17:57,903 : INFO : PROGRESS: at 99.97% words, alpha 0.00016, 82081 words/s
2015-02-24 11:17:57,975 : INFO : training on 16718844 words took 203.7s, 82078 words/s
 
In [6]: model.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)2015-02-24 11:18:38,021 : INFO : precomputing L2-norms of word weight vectors
Out[6]: [(u'wenceslaus', 0.5203313827514648)]
 
In [7]: model.most_similar(positive=['woman', 'king'], negative=['man'], topn=2) 
Out[7]: [(u'wenceslaus', 0.5203313827514648), (u'queen', 0.508660614490509)]
 
In [8]: model.most_similar(['man'])
Out[8]: 
[(u'woman', 0.5686948895454407),
 (u'girl', 0.4957364797592163),
 (u'young', 0.4457539916038513),
 (u'luckiest', 0.4420626759529114),
 (u'serpent', 0.42716869711875916),
 (u'girls', 0.42680859565734863),
 (u'smokes', 0.4265017509460449),
 (u'creature', 0.4227582812309265),
 (u'robot', 0.417464017868042),
 (u'mortal', 0.41728296875953674)]
 
In [9]: model.save('text8.model')
2015-02-24 11:19:26,059 : INFO : saving Word2Vec object under text8.model, separately None
2015-02-24 11:19:26,060 : INFO : not storing attribute syn0norm
2015-02-24 11:19:26,060 : INFO : storing numpy array 'syn0' to text8.model.syn0.npy
2015-02-24 11:19:26,742 : INFO : storing numpy array 'syn1' to text8.model.syn1.npy
 
In [10]: model.save_word2vec_format('text.model.bin', binary=True)
2015-02-24 11:19:52,341 : INFO : storing 71290x200 projection weights into text.model.bin
 
In [12]: model1 = word2vec.Word2Vec.load_word2vec_format('text.model.bin', binary=True)
2015-02-24 11:22:08,185 : INFO : loading projection weights from text.model.bin
2015-02-24 11:22:10,322 : INFO : loaded (71290, 200) matrix from text.model.bin
2015-02-24 11:22:10,322 : INFO : precomputing L2-norms of word weight vectors
 
In [13]: model1.most_similar(['girl', 'father'], ['boy'], topn=3)
Out[13]: 
[(u'mother', 0.6219865083694458),
 (u'grandmother', 0.556104838848114),
 (u'wife', 0.5440170764923096)]
 
In [14]: more_examples = ["he is she", "big bigger bad", "going went being"] 
 
In [15]: for example in more_examples:
   ....:     a, b, x = example.split()
   ....:     predicted = model.most_similar([x, b], [a])[0][0]
   ....:     print "'%s' is to '%s' as '%s' is to '%s'" % (a, b, x, predicted)
   ....:     
'he' is to 'is' as 'she' is to 'was'
'big' is to 'bigger' as 'bad' is to 'worse'
'going' is to 'went' as 'being' is to 'were'

By gensim word2vec module, you can also load the trained model by original c/c++ word2vec pakcage:

In [16]: model_org = word2vec.Word2Vec.load_word2vec_format('vectors.bin', binary=True)
2015-02-24 11:35:01,814 : INFO : loading projection weights from vectors.bin
2015-02-24 11:35:03,756 : INFO : loaded (71291, 200) matrix from vectors.bin
2015-02-24 11:35:03,757 : INFO : precomputing L2-norms of word weight vectors
 
In [17]: model_org.most_similar('frog')
Out[17]: 
[(u'lizard', 0.5382058024406433),
 (u'kermit', 0.522418737411499),
 (u'squirrel', 0.502967357635498),
 (u'toad', 0.5023295283317566),
 (u'poodle', 0.49445223808288574),
 (u'gigas', 0.4928397536277771),
 (u'moth', 0.49125388264656067),
 (u'frogs', 0.4899308979511261),
 (u'shrew', 0.48939722776412964),
 (u'cute', 0.4872947931289673)]

Gensim has also provided some better materials about word2vec in python, you can reference them by following articles:

GloVe in Python
glove-python is a python implementation of GloVe:

Installation

Clone this repository.
Make sure you have a compiler that supports OpenMP and C++11. On OSX, you’ll need to install gcc from brew or ports. The setup script uses gcc-4.9, but you can probably change that.
Make sure you have Cython installed.
Run python setup.py develop to install in development mode; python setup.py install to install normally.
from glove import Glove, Corpus should get you started.

Usage

Producing the embeddings is a two-step process: creating a co-occurrence matrix from the corpus, and then using it to produce the embeddings. The Corpus class helps in constructing a corpus from an interable of tokens; the Glove class trains the embeddings (with a sklearn-esque API).

There is also support for rudimentary pagragraph vectors. A paragraph vector (in this case) is an embedding of a paragraph (a multi-word piece of text) in the word vector space in such a way that the paragraph representation is close to the words it contains, adjusted for the frequency of words in the corpus (in a manner similar to tf-idf weighting). These can be obtained after having trained word embeddings by calling the transform_paragraph method on the trained model.

After install glove-python, you can use it like this:

In [1]: import itertools
 
In [2]: from gensim.models.word2vec import Text8Corpus
 
In [3]: from glove import Corpus, Glove
 
In [4]: sentences = list(itertools.islice(Text8Corpus('text8'),None))
 
In [5]: corpus = Corpus()
 
In [6]: corpus.fit(sentences, window=10)
 
In [7]: glove = Glove(no_components=100, learning_rate=0.05)
 
In [8]: glove.fit(corpus.matrix, epochs=30, no_threads=4, verbose=True)
Performing 30 training epochs with 4 threads
Epoch 0
Epoch 1
Epoch 2
...
Epoch 27
Epoch 28
Epoch 29
 
In [9]: glove.add_dictionary(corpus.dictionary)
 
In [10]: glove.most_similar('man')
Out[10]: 
[(u'terc', 0.82866443231836828),
 (u'woman', 0.81587362007162523),
 (u'girl', 0.79950702967210407),
 (u'young', 0.78944050406331179)]
 
In [12]: glove.most_similar('man', number=10)
Out[12]: 
[(u'terc', 0.82866443231836828),
 (u'woman', 0.81587362007162523),
 (u'girl', 0.79950702967210407),
 (u'young', 0.78944050406331179),
 (u'spider', 0.78827287082192377),
 (u'wise', 0.7662819233076561),
 (u'men', 0.70576506880860157),
 (u'beautiful', 0.69492684203254429),
 (u'evil', 0.6887102864856347)]
 
In [13]: glove.most_similar('frog', number=10)
Out[13]: 
[(u'shark', 0.75775974484778419),
 (u'giant', 0.71914687122031595),
 (u'dodo', 0.70756087345768237),
 (u'dome', 0.70536309001812902),
 (u'serpent', 0.69089042980042681),
 (u'vicious', 0.68885819147237815),
 (u'blonde', 0.68574786672123234),
 (u'panda', 0.6832336174432142),
 (u'penny', 0.68202780165909405)]
 
In [14]: glove.most_similar('girl', number=10)
Out[14]: 
[(u'man', 0.79950702967210407),
 (u'woman', 0.79380171669979771),
 (u'baby', 0.77935645649673957),
 (u'beautiful', 0.77447992804057431),
 (u'young', 0.77355323458632896),
 (u'wise', 0.76219894067614957),
 (u'handsome', 0.74155095749823707),
 (u'girls', 0.72011371864695584),
 (u'atelocynus', 0.71560826080222384)]
 
In [15]: glove.most_similar('car', number=10)
Out[15]: 
[(u'driver', 0.88683873415652947),
 (u'race', 0.84554581794165884),
 (u'crash', 0.76818020141393994),
 (u'cars', 0.76308628267402701),
 (u'taxi', 0.76197230282808859),
 (u'racing', 0.7384645880932772),
 (u'touring', 0.73836030272284159),
 (u'accident', 0.69000847113708996),
 (u'manufacturer', 0.67263805153963518)]
 
In [16]: glove.most_similar('queen', number=10)
Out[16]: 
[(u'elizabeth', 0.91700558183820069),
 (u'victoria', 0.87533970402870487),
 (u'mary', 0.85515424257738148),
 (u'anne', 0.78273531080737502),
 (u'prince', 0.76833451608330772),
 (u'lady', 0.75227426771795192),
 (u'princess', 0.73927079922218319),
 (u'catherine', 0.73538567181156611),
 (u'tudor', 0.73028985404704971)]

Posted by TextMiner

Deep Learning Specialization on Coursera

Comments

Getting Started with Word2Vec and GloVe in Python — 15 Comments

  1. Installation of glove for python does not seem to be very straightforward.
    It seems ok, but when import the glove module I get
    import glove
    —————————————————————————
    ImportError Traceback (most recent call last)
    in ()
    —-> 1 import glove
    C:\Users\David\build\bdist.win-amd64\egg\glove\__init__.py in ()
    C:\Users\David\build\bdist.win-amd64\egg\glove\glove.py in ()
    C:\Users\David\build\bdist.win-amd64\egg\glove\glove_cython.py in ()
    C:\Users\David\build\bdist.win-amd64\egg\glove\glove_cython.py in __bootstrap__()
    ImportError: DLL load failed: Invalid access to memory location.

    Any ideas? I’m running on Windows7

  2. can i get a integer or floating point unique number associated with each word depending on its frequency of occurrence or position in wordnet

  3. Hello.
    What exactly the data text8 looks like? I’d like to try it out on my own data, but can’t get it, how the data are stored inside this file.
    Thanks in advance.

  4. I got this error.Please help me solve this error i m using python 3 and ubuntu 16.04
    UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0x81 in position 14: invalid start byte

Leave a Reply to sarath Cancel reply

Your email address will not be published.