Getting Started with Word2Vec and GloVe
Word2Vec and GloVe are two popular word embedding algorithms recently which used to construct vector representations for words. And those methods can be used to compute the semantic similarity between words by the mathematically vector representation. The c/c++ tools for word2vec and glove are also open source by the writer and implemented by other languages like python and java. Here we will give a simple tutorial for how to use the word2vec and glove in mac os and linux ubuntu.
1. Word2Vec: Tool for computing continuous distributed representations of words.
Projet website:
https://code.google.com/p/word2vec/
Introduction:
This tool provides an efficient implementation of the continuous bag-of-words and skip-gram architectures for computing vector representations of words. These representations can be subsequently used in many natural language processing applications and for further research.
Related Paper:
[1] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient Estimation of Word Representations in Vector Space. In Proceedings of Workshop at ICLR, 2013.
[2] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed Representations of Words and Phrases and their Compositionality. In Proceedings of NIPS, 2013.
[3] Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic Regularities in Continuous Space Word Representations. In Proceedings of NAACL HLT, 2013.
Quick start
In My Linux Ubuntu 14.04 VPS:
svn checkout http://word2vec.googlecode.com/svn/trunk/
cd trunk/
make
Run the demo scripts: ./demo-word.sh
make: Nothing to be done for `all'.
--2015-02-06 16:10:37-- http://mattmahoney.net/dc/text8.zip
Resolving mattmahoney.net (mattmahoney.net)... 98.139.135.198
Connecting to mattmahoney.net (mattmahoney.net)|98.139.135.198|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 31344016 (30M) [application/zip]
Saving to: ‘text8.gz’
100%[======================================>] 31,344,016 1.14MB/s in 28s
2015-02-06 16:11:06 (1.06 MB/s) - ‘text8.gz’ saved [31344016/31344016]
+ ./word2vec -train text8 -output vectors.bin -cbow 1 -size 200 -window 8 -negative 25 -hs 0 -sample 1e-4 -threads 20 -binary 1 -iter 15
Starting training using file text8
Vocab size: 71291
Words in train file: 16718843
Alpha: 0.048291 Progress: 3.42% Words/thread/sec: 36.35k Alpha: 0.048288 Progress: 3.43% Words/thread/sec: 36.22k Alpha: 0.048286 Progress: 3.43% Words/thread/sec: 36.27k Alpha:Alpha: 0.048282Alpha: 0.048280 Progress: AlpAlpAlpAlpha:Alpha: 0.048121 Progress: 3Alpha: 0.048078 Progress: 3.85% WoAlphAlphAlpha: 0.000005 Progress: 100.10% Words/thread/sec: 36.51k
real 31m21.907s
user 114m19.023s
sys 0m21.822s
+ ./distance vectors.bin
Enter word or sentence (EXIT to break): frog
Word: frog Position in vocabulary: 10075
Word Cosine distance
------------------------------------------------------------------------
lizard 0.538206
kermit 0.522419
squirrel 0.502967
toad 0.502330
poodle 0.494452
gigas 0.492840
moth 0.491254
frogs 0.489931
shrew 0.489397
cute 0.487295
crocodile 0.485067
crustacean 0.480669
claw 0.480084
reptile 0.474769
macaw 0.473430
midget 0.472230
jellyfish 0.471446
melon 0.471243
cockroach 0.470895
turkeys 0.470282
sponges 0.469138
crab 0.466412
turtle 0.465428
possum 0.464029
marsupial 0.463663
sapo 0.463064
bunny 0.462950
butterflies 0.460651
chipmunks 0.458148
platypus 0.456636
ostrich 0.455247
herbivore 0.453438
snail 0.451927
bufo 0.450847
chondrichthyes 0.448135
proboscis 0.445187
bald 0.445096
wombats 0.443952
brachydanio 0.442688
possums 0.442489
In My Mac OS:
svn checkout http://word2vec.googlecode.com/svn/trunk/
cd trunk/
make
gcc word2vec.c -o word2vec -lm -pthread -O3 -march=native -Wall -funroll-loops -Wno-unused-result
cc1: error: unrecognized command line option "-Wno-unused-result"
word2vec.c:1: error: bad value (native) for -march= switch
word2vec.c:1: error: bad value (native) for -mtune= switch
make: *** [word2vec] Error 1
Here we need modified the Makefile for mac os:
3 #CFLAGS = -lm -pthread -O3 -march=native -Wall -funroll-loops -Wno-unused-re sult
4 CFLAGS = -lm -pthread -O3 -Wall -funroll-loops
“make” again but get other errors again:
gcc word2vec.c -o word2vec -lm -pthread -O3 -Wall -funroll-loops
gcc word2phrase.c -o word2phrase -lm -pthread -O3 -Wall -funroll-loops
gcc distance.c -o distance -lm -pthread -O3 -Wall -funroll-loops
distance.c:18:20: error: malloc.h: No such file or directory
distance.c: In function ‘main’:
distance.c:46: warning: implicit declaration of function ‘malloc’
distance.c:46: warning: incompatible implicit declaration of built-in function ‘malloc’
distance.c:31: warning: unused variable ‘ch’
make: *** [distance] Error 1
Here need modify distance.c, word-analogy.c and compute-accuracy.c:
#include
then make we get the right make result.
Run the demo scripts: ./demo-word.sh and get same result like the ubuntu result.
2. GloVe: Global Vectors for Word Representation
Project website:
http://nlp.stanford.edu/projects/glove/
Introduction:
GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.
Related Paper:
GloVe: Global Vectors for Word Representation
Quick start
In My Linux Ubuntu 14.04 VPS:
wget http://www-nlp.stanford.edu/software/glove.tar.gz
--2015-02-06 19:00:06-- http://www-nlp.stanford.edu/software/glove.tar.gz
Resolving www-nlp.stanford.edu (www-nlp.stanford.edu)... 171.64.67.140
Connecting to www-nlp.stanford.edu (www-nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 93654 (91K) [application/x-gzip]
Saving to: ‘glove.tar.gz’
100%[======================================>] 93,654 276KB/s in 0.3s
2015-02-06 19:00:07 (276 KB/s) - ‘glove.tar.gz’ saved [93654/93654]
tar -zxvf glove.tar.gz
glove/
glove/demo.sh
glove/cooccur.c
glove/shuffle.c
glove/makefile
glove/glove.c
glove/LICENSE
glove/vocab_count.c
glove/eval/
glove/eval/._read_and_evaluate.m
glove/eval/WordLookup.m
glove/eval/question-data/
glove/eval/question-data/gram2-opposite.txt
glove/eval/question-data/capital-world.txt
glove/eval/question-data/._family.txt
glove/eval/question-data/city-in-state.txt
glove/eval/question-data/gram6-nationality-adjective.txt
glove/eval/question-data/._gram8-plural.txt
glove/eval/question-data/._gram1-adjective-to-adverb.txt
glove/eval/question-data/gram9-plural-verbs.txt
glove/eval/question-data/._capital-common-countries.txt
glove/eval/question-data/._gram7-past-tense.txt
glove/eval/question-data/gram3-comparative.txt
glove/eval/question-data/._gram5-present-participle.txt
glove/eval/question-data/._capital-world.txt
glove/eval/question-data/currency.txt
glove/eval/question-data/._currency.txt
glove/eval/question-data/gram5-present-participle.txt
glove/eval/question-data/gram1-adjective-to-adverb.txt
glove/eval/question-data/._city-in-state.txt
glove/eval/question-data/._gram6-nationality-adjective.txt
glove/eval/question-data/capital-common-countries.txt
glove/eval/question-data/._gram2-opposite.txt
glove/eval/question-data/family.txt
glove/eval/question-data/gram7-past-tense.txt
glove/eval/question-data/gram8-plural.txt
glove/eval/question-data/._gram3-comparative.txt
glove/eval/question-data/._gram4-superlative.txt
glove/eval/question-data/._gram9-plural-verbs.txt
glove/eval/question-data/gram4-superlative.txt
glove/eval/read_and_evaluate.m
glove/eval/evaluate_vectors.m
glove/eval/._evaluate_vectors.m
glove/eval/._question-data
glove/eval/._WordLookup.m
glove/README
cd glove
make
gcc glove.c -o glove -lm -pthread -Ofast -march=native -funroll-loops -Wno-unused-result
gcc shuffle.c -o shuffle -lm -pthread -Ofast -march=native -funroll-loops -Wno-unused-result
gcc cooccur.c -o cooccur -lm -pthread -Ofast -march=native -funroll-loops -Wno-unused-result
gcc vocab_count.c -o vocab_count -lm -pthread -Ofast -march=native -funroll-loops -Wno-unused-result
./demo.sh
make: Nothing to be done for `all'.
--2015-02-06 19:08:21-- http://mattmahoney.net/dc/text8.zip
Resolving mattmahoney.net (mattmahoney.net)... 98.139.135.198
Connecting to mattmahoney.net (mattmahoney.net)|98.139.135.198|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 31344016 (30M) [application/zip]
Saving to: ‘text8.zip’
100%[======================================>] 31,344,016 1.11MB/s in 31s
2015-02-06 19:09:26 (992 KB/s) - ‘text8.zip’ saved [31344016/31344016]
Archive: text8.zip
inflating: text8
BUILDING VOCABULARY
Processed 17005207 tokens.
Counted 253854 unique words.
Truncating vocabulary at min count 5.
Using vocabulary of size 71290.
COUNTING COOCCURRENCES
window size: 15
context: symmetric
max product: 13752509
overflow length: 38028356
Reading vocab from file "vocab.txt"...loaded 71290 words.
Building lookup table...table contains 94990279 elements.
Processed 17005207 tokens.
Counted 253854 unique words.
Truncating vocabulary at min count 5.
Using vocabulary of size 71290.
COUNTING COOCCURRENCES
window size: 15
context: symmetric
max product: 13752509
overflow length: 38028356
Reading vocab from file "vocab.txt"...loaded 71290 words.
Building lookup table...table contains 94990279 elements.
Processed 17005206 tokens.p
Writing cooccurrences to disk.........2 files in total.
Merging cooccurrence files: processed 60666466 lines.
SHUFFLING COOCCURRENCES
array size: 255013683
Shuffling by chunks: processed 60666466 lines.
Wrote 1 temporary file(s).
Merging temp files: processed 60666466 lines.
50 -binary 2 -vocab-file vocab.txt -verbose 2
TRAINING MODEL
Read 60666466 lines.
Initializing parameters...done.
vector size: 50
vocab size: 71290
x_max: 10.000000
alpha: 0.750000
iter: 001, cost: 0.068993
iter: 002, cost: 0.051689
iter: 003, cost: 0.046155
iter: 004, cost: 0.043026
iter: 005, cost: 0.041184
iter: 006, cost: 0.039980
iter: 007, cost: 0.039116
iter: 008, cost: 0.038461
iter: 009, cost: 0.037943
iter: 010, cost: 0.037521
iter: 011, cost: 0.037173
iter: 012, cost: 0.036875
iter: 013, cost: 0.036622
iter: 014, cost: 0.036398
iter: 015, cost: 0.036204
line 37: matlab: command not found
Here we have trained the word2vec model: vectors.bin and vectors.txt, but glove need matlab to check the distance, and we will tell you how to use the model in the python in the next post.
In My Mac OS:
wget http://www-nlp.stanford.edu/software/glove.tar.gz
tar -zxvf glove.tar.gz
cd glove
make
gcc glove.c -o glove -lm -pthread -Ofast -march=native -funroll-loops -Wno-unused-result
cc1: error: invalid option argument ‘-Ofast’
cc1: error: unrecognized command line option "-Wno-unused-result"
glove.c:1: error: bad value (native) for -march= switch
glove.c:1: error: bad value (native) for -mtune= switch
make: *** [glove] Error 1
Like word2vec, need modify makefile:
#CFLAGS = -lm -pthread -Ofast -march=native -funroll-loops -Wno-unused-res ult
CFLAGS = -lm -pthread -funroll-loops
make
gcc glove.c -o glove -lm -pthread -funroll-loops
gcc shuffle.c -o shuffle -lm -pthread -funroll-loops
gcc cooccur.c -o cooccur -lm -pthread -funroll-loops
gcc vocab_count.c -o vocab_count -lm -pthread -funroll-loops
run the demo.sh we will get the same result like the ubuntu vps.
We will stop here and will tell you how to use them in python in the next chapter.
Posted by TextMiner
I have now the file vectors.txt.
Expecting chapter 2 to know how to use it…
Will this work for Windows too?
I m not test it under Windows, but if install proper windows python env, it’s ok for it work for Windows.
Regarding the dataset, how do you load the dataset. If o want to use a different dataset how do I import it and use it for this code? Can i import any kind of dataset or should it have any specifications?
The simple way to preprocess any text file to one line for one article, and then use it to train the model
so basically, word2vec model use for producing vector representation for word. Can i use cosine similarity to apply for metric in DBSCAN algorithm for text clustering. Btw i cannot access to the code link. Thank you very much!
I have numerous word documents from which I need to display an output as a word visualization (like a word cloud) using Glove method.. how to do this in windows MATLAB environment??