Getting Started with Word2Vec and GloVe in Python
We have talked about “Getting Started with Word2Vec and GloVe“, and how to use them in a pure python environment? Here we wil tell you how to use word2vec and glove by python.
Word2Vec in Python
The great topic modeling tool gensim has implemented the word2vec in python, you should install gensim first, then use word2vec like this:
In [1]: from gensim.models import word2vec In [2]: import logging In [3]: logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO) In [4]: sentences = word2vec.Text8Corpus('text8') In [5]: model = word2vec.Word2Vec(sentences, size=200) 2015-02-24 11:14:15,428 : INFO : collecting all words and their counts 2015-02-24 11:14:15,429 : INFO : PROGRESS: at sentence #0, processed 0 words and 0 word types 2015-02-24 11:14:22,863 : INFO : PROGRESS: at sentence #10000, processed 10000000 words and 189074 word types 2015-02-24 11:14:28,218 : INFO : collected 253854 word types from a corpus of 17005207 words and 17006 sentences 2015-02-24 11:14:28,362 : INFO : total 71290 word types after removing those with count<5 2015-02-24 11:14:28,362 : INFO : constructing a huffman tree from 71290 words 2015-02-24 11:14:32,431 : INFO : built huffman tree with maximum node depth 22 2015-02-24 11:14:32,509 : INFO : resetting layer weights 2015-02-24 11:14:34,279 : INFO : training model with 1 workers on 71290 vocabulary and 200 features, using 'skipgram'=1 'hierarchical softmax'=1 'subsample'=0 and 'negative sampling'=0 2015-02-24 11:14:35,550 : INFO : PROGRESS: at 0.59% words, alpha 0.02500, 77772 words/s 2015-02-24 11:14:36,581 : INFO : PROGRESS: at 1.18% words, alpha 0.02485, 85486 words/s 2015-02-24 11:14:37,661 : INFO : PROGRESS: at 1.77% words, alpha 0.02471, 87258 words/s ... 2015-02-24 11:17:56,434 : INFO : PROGRESS: at 99.38% words, alpha 0.00030, 82190 words/s 2015-02-24 11:17:57,903 : INFO : PROGRESS: at 99.97% words, alpha 0.00016, 82081 words/s 2015-02-24 11:17:57,975 : INFO : training on 16718844 words took 203.7s, 82078 words/s In [6]: model.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)2015-02-24 11:18:38,021 : INFO : precomputing L2-norms of word weight vectors Out[6]: [(u'wenceslaus', 0.5203313827514648)] In [7]: model.most_similar(positive=['woman', 'king'], negative=['man'], topn=2) Out[7]: [(u'wenceslaus', 0.5203313827514648), (u'queen', 0.508660614490509)] In [8]: model.most_similar(['man']) Out[8]: [(u'woman', 0.5686948895454407), (u'girl', 0.4957364797592163), (u'young', 0.4457539916038513), (u'luckiest', 0.4420626759529114), (u'serpent', 0.42716869711875916), (u'girls', 0.42680859565734863), (u'smokes', 0.4265017509460449), (u'creature', 0.4227582812309265), (u'robot', 0.417464017868042), (u'mortal', 0.41728296875953674)] In [9]: model.save('text8.model') 2015-02-24 11:19:26,059 : INFO : saving Word2Vec object under text8.model, separately None 2015-02-24 11:19:26,060 : INFO : not storing attribute syn0norm 2015-02-24 11:19:26,060 : INFO : storing numpy array 'syn0' to text8.model.syn0.npy 2015-02-24 11:19:26,742 : INFO : storing numpy array 'syn1' to text8.model.syn1.npy In [10]: model.save_word2vec_format('text.model.bin', binary=True) 2015-02-24 11:19:52,341 : INFO : storing 71290x200 projection weights into text.model.bin In [12]: model1 = word2vec.Word2Vec.load_word2vec_format('text.model.bin', binary=True) 2015-02-24 11:22:08,185 : INFO : loading projection weights from text.model.bin 2015-02-24 11:22:10,322 : INFO : loaded (71290, 200) matrix from text.model.bin 2015-02-24 11:22:10,322 : INFO : precomputing L2-norms of word weight vectors In [13]: model1.most_similar(['girl', 'father'], ['boy'], topn=3) Out[13]: [(u'mother', 0.6219865083694458), (u'grandmother', 0.556104838848114), (u'wife', 0.5440170764923096)] In [14]: more_examples = ["he is she", "big bigger bad", "going went being"] In [15]: for example in more_examples: ....: a, b, x = example.split() ....: predicted = model.most_similar([x, b], [a])[0][0] ....: print "'%s' is to '%s' as '%s' is to '%s'" % (a, b, x, predicted) ....: 'he' is to 'is' as 'she' is to 'was' 'big' is to 'bigger' as 'bad' is to 'worse' 'going' is to 'went' as 'being' is to 'were' |
By gensim word2vec module, you can also load the trained model by original c/c++ word2vec pakcage:
In [16]: model_org = word2vec.Word2Vec.load_word2vec_format('vectors.bin', binary=True) 2015-02-24 11:35:01,814 : INFO : loading projection weights from vectors.bin 2015-02-24 11:35:03,756 : INFO : loaded (71291, 200) matrix from vectors.bin 2015-02-24 11:35:03,757 : INFO : precomputing L2-norms of word weight vectors In [17]: model_org.most_similar('frog') Out[17]: [(u'lizard', 0.5382058024406433), (u'kermit', 0.522418737411499), (u'squirrel', 0.502967357635498), (u'toad', 0.5023295283317566), (u'poodle', 0.49445223808288574), (u'gigas', 0.4928397536277771), (u'moth', 0.49125388264656067), (u'frogs', 0.4899308979511261), (u'shrew', 0.48939722776412964), (u'cute', 0.4872947931289673)] |
Gensim has also provided some better materials about word2vec in python, you can reference them by following articles:
- models.word2vec – Deep learning with word2vec
- Deep learning with word2vec and gensim
- Word2vec Tutorial
- Making sense of word2vec
GloVe in Python
glove-python is a python implementation of GloVe:
Installation
Clone this repository.
Make sure you have a compiler that supports OpenMP and C++11. On OSX, you’ll need to install gcc from brew or ports. The setup script uses gcc-4.9, but you can probably change that.
Make sure you have Cython installed.
Run python setup.py develop to install in development mode; python setup.py install to install normally.
from glove import Glove, Corpus should get you started.Usage
Producing the embeddings is a two-step process: creating a co-occurrence matrix from the corpus, and then using it to produce the embeddings. The Corpus class helps in constructing a corpus from an interable of tokens; the Glove class trains the embeddings (with a sklearn-esque API).
There is also support for rudimentary pagragraph vectors. A paragraph vector (in this case) is an embedding of a paragraph (a multi-word piece of text) in the word vector space in such a way that the paragraph representation is close to the words it contains, adjusted for the frequency of words in the corpus (in a manner similar to tf-idf weighting). These can be obtained after having trained word embeddings by calling the transform_paragraph method on the trained model.
After install glove-python, you can use it like this:
In [1]: import itertools In [2]: from gensim.models.word2vec import Text8Corpus In [3]: from glove import Corpus, Glove In [4]: sentences = list(itertools.islice(Text8Corpus('text8'),None)) In [5]: corpus = Corpus() In [6]: corpus.fit(sentences, window=10) In [7]: glove = Glove(no_components=100, learning_rate=0.05) In [8]: glove.fit(corpus.matrix, epochs=30, no_threads=4, verbose=True) Performing 30 training epochs with 4 threads Epoch 0 Epoch 1 Epoch 2 ... Epoch 27 Epoch 28 Epoch 29 In [9]: glove.add_dictionary(corpus.dictionary) In [10]: glove.most_similar('man') Out[10]: [(u'terc', 0.82866443231836828), (u'woman', 0.81587362007162523), (u'girl', 0.79950702967210407), (u'young', 0.78944050406331179)] In [12]: glove.most_similar('man', number=10) Out[12]: [(u'terc', 0.82866443231836828), (u'woman', 0.81587362007162523), (u'girl', 0.79950702967210407), (u'young', 0.78944050406331179), (u'spider', 0.78827287082192377), (u'wise', 0.7662819233076561), (u'men', 0.70576506880860157), (u'beautiful', 0.69492684203254429), (u'evil', 0.6887102864856347)] In [13]: glove.most_similar('frog', number=10) Out[13]: [(u'shark', 0.75775974484778419), (u'giant', 0.71914687122031595), (u'dodo', 0.70756087345768237), (u'dome', 0.70536309001812902), (u'serpent', 0.69089042980042681), (u'vicious', 0.68885819147237815), (u'blonde', 0.68574786672123234), (u'panda', 0.6832336174432142), (u'penny', 0.68202780165909405)] In [14]: glove.most_similar('girl', number=10) Out[14]: [(u'man', 0.79950702967210407), (u'woman', 0.79380171669979771), (u'baby', 0.77935645649673957), (u'beautiful', 0.77447992804057431), (u'young', 0.77355323458632896), (u'wise', 0.76219894067614957), (u'handsome', 0.74155095749823707), (u'girls', 0.72011371864695584), (u'atelocynus', 0.71560826080222384)] In [15]: glove.most_similar('car', number=10) Out[15]: [(u'driver', 0.88683873415652947), (u'race', 0.84554581794165884), (u'crash', 0.76818020141393994), (u'cars', 0.76308628267402701), (u'taxi', 0.76197230282808859), (u'racing', 0.7384645880932772), (u'touring', 0.73836030272284159), (u'accident', 0.69000847113708996), (u'manufacturer', 0.67263805153963518)] In [16]: glove.most_similar('queen', number=10) Out[16]: [(u'elizabeth', 0.91700558183820069), (u'victoria', 0.87533970402870487), (u'mary', 0.85515424257738148), (u'anne', 0.78273531080737502), (u'prince', 0.76833451608330772), (u'lady', 0.75227426771795192), (u'princess', 0.73927079922218319), (u'catherine', 0.73538567181156611), (u'tudor', 0.73028985404704971)] |
Posted by TextMiner
Installation of glove for python does not seem to be very straightforward.
It seems ok, but when import the glove module I get
import glove
—————————————————————————
ImportError Traceback (most recent call last)
in ()
—-> 1 import glove
C:\Users\David\build\bdist.win-amd64\egg\glove\__init__.py in ()
C:\Users\David\build\bdist.win-amd64\egg\glove\glove.py in ()
C:\Users\David\build\bdist.win-amd64\egg\glove\glove_cython.py in ()
C:\Users\David\build\bdist.win-amd64\egg\glove\glove_cython.py in __bootstrap__()
ImportError: DLL load failed: Invalid access to memory location.
Any ideas? I’m running on Windows7
can i get a integer or floating point unique number associated with each word depending on its frequency of occurrence or position in wordnet
Hi your example for glove on frog differs from
The results on the Stanford glove page on frog. Ideas?
Hello.
What exactly the data text8 looks like? I’d like to try it out on my own data, but can’t get it, how the data are stored inside this file.
Thanks in advance.
What if we want to use windowising for each sentence but not for the whole text at the same time?
yield sentence.split()
text8 is one line file
do you have a link to the file?
Thank you so much, this is great !
Hi,
The Glove library method has been changed. Could you please give us example in the latest GloVe
I will see it
hi
How do you extract vectors from Glove given a word?
i tried
glove[‘man’]
but it does not work!
I got this error.Please help me solve this error i m using python 3 and ubuntu 16.04
UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0x81 in position 14: invalid start byte
How do I get the word vectors from the GloVe model?
just by Glove code