Training Word2Vec Model on English Wikipedia by Gensim
After learning word2vec and glove, a natural way to think about them is training a related model on a larger corpus, and english wikipedia is an ideal choice for this task. After google the related keywords like “word2vec wikipedia”, “gensim word2vec wikipedia”, I found in the gensim google groups, the discuss in the post “training word2vec on full Wikipedia” give a proper solution for this task. Note that there is another choice by the wiki2vec, but I thought the gensim word2vec way is very simple and effective.
The data I downloaded from english wikipedia is in here(download data is 2015-03-01, about 11G):
https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
First, we need process the xml format wikipedia to text format, copy the code from the post in the process_wiki.py:
Note: I have modified a new version to compatible python2.x and python3.x (2017.05.01)
#!/usr/bin/env python # -*- coding: utf-8 -*- # Author: Pan Yang (panyangnlp@gmail.com) # Copyrigh 2017 from __future__ import print_function import logging import os.path import six import sys from gensim.corpora import WikiCorpus if __name__ == '__main__': program = os.path.basename(sys.argv[0]) logger = logging.getLogger(program) logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s') logging.root.setLevel(level=logging.INFO) logger.info("running %s" % ' '.join(sys.argv)) # check and process input arguments if len(sys.argv) != 3: print("Using: python process_wiki.py enwiki.xxx.xml.bz2 wiki.en.text") sys.exit(1) inp, outp = sys.argv[1:3] space = " " i = 0 output = open(outp, 'w') wiki = WikiCorpus(inp, lemmatize=False, dictionary={}) for text in wiki.get_texts(): if six.PY3: output.write(b' '.join(text).decode('utf-8') + '\n') # ###another method### # output.write( # space.join(map(lambda x:x.decode("utf-8"), text)) + '\n') else: output.write(space.join(text) + "\n") i = i + 1 if (i % 10000 == 0): logger.info("Saved " + str(i) + " articles") output.close() logger.info("Finished Saved " + str(i) + " articles") |
Note there is a little different here:
wiki = WikiCorpus(inp, lemmatize=False, dictionary={})
The original is: wiki = WikiCorpus(inp, dictionary={})
we set the lemmatize to False to not use pattern, cause pattern could slow the process deeply.
Execute “python process_wiki.py enwiki-latest-pages-articles.xml.bz2 wiki.en.text”, we get:
2015-03-07 15:08:39,181: INFO: running process_enwiki.py enwiki-latest-pages-articles.xml.bz2 wiki.en.text 2015-03-07 15:11:12,860: INFO: Saved 10000 articles 2015-03-07 15:13:25,369: INFO: Saved 20000 articles 2015-03-07 15:15:19,771: INFO: Saved 30000 articles 2015-03-07 15:16:58,424: INFO: Saved 40000 articles 2015-03-07 15:18:12,374: INFO: Saved 50000 articles 2015-03-07 15:19:03,213: INFO: Saved 60000 articles 2015-03-07 15:19:47,656: INFO: Saved 70000 articles 2015-03-07 15:20:29,135: INFO: Saved 80000 articles 2015-03-07 15:22:02,365: INFO: Saved 90000 articles 2015-03-07 15:23:40,141: INFO: Saved 100000 articles ..... 2015-03-07 19:33:16,549: INFO: Saved 3700000 articles 2015-03-07 19:33:49,493: INFO: Saved 3710000 articles 2015-03-07 19:34:23,442: INFO: Saved 3720000 articles 2015-03-07 19:34:57,984: INFO: Saved 3730000 articles 2015-03-07 19:35:31,976: INFO: Saved 3740000 articles 2015-03-07 19:36:05,790: INFO: Saved 3750000 articles 2015-03-07 19:36:32,392: INFO: finished iterating over Wikipedia corpus of 3758076 documents with 2018886604 positions (total 15271374 articles, 2075130438 positions before pruning articles shorter than 50 words) 2015-03-07 19:36:32,394: INFO: Finished Saved 3758076 articles |
After processing about 5 hours on my macpro(cpu is 4 core and RAM is 16G), we get a 12G wiki.en.text, one article one line, like this, the punctuation has been discarded:
anarchism is collection of movements and ideologies that hold the state to be undesirable unnecessary or harmful these movements advocate some form of stateless society instead often based on self governed voluntary institutions or non hierarchical free associations although anti statism is central to anarchism as political philosophy anarchism also entails rejection of and often hierarchical organisation in general as an anti dogmatic philosophy anarchism draws on many currents of thought and strategy anarchism does not offer fixed body of doctrine from single particular world view instead fluxing and flowing as philosophy there are many types and traditions of anarchism not all of which are mutually exclusive anarchist schools of thought can differ fundamentally supporting anything from extreme individualism to complete collectivism strains of anarchism have often been divided into the categories of social and individualist anarchism or similar dual classifications anarchism is usually considered radical left wing ideology and much of anarchist economics and anarchist legal philosophy reflect anti authoritarian interpretations of communism collectivism syndicalism mutualism or participatory economics etymology and terminology the term anarchism is compound word composed from the word anarchy and the suffix ism themselves derived respectively from the greek anarchy from anarchos meaning one without rulers from the privative prefix ἀν an without and archos leader ruler cf archon or arkhē authority sovereignty realm magistracy and the suffix or ismos isma from the verbal infinitive suffix...
The second step is training the word2vec model from the text, you can use the original word2vc binary or glove binary to train related model like the tex8 file, but seems it’s very slow. Like the post, we use the gensim word2vec model to train the english wikipedia model, copy the code from the post in the train_word2vec_model.py:
#!/usr/bin/env python # -*- coding: utf-8 -*- import logging import os.path import sys import multiprocessing from gensim.corpora import WikiCorpus from gensim.models import Word2Vec from gensim.models.word2vec import LineSentence if __name__ == '__main__': program = os.path.basename(sys.argv[0]) logger = logging.getLogger(program) logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s') logging.root.setLevel(level=logging.INFO) logger.info("running %s" % ' '.join(sys.argv)) # check and process input arguments if len(sys.argv) < 3: print globals()['__doc__'] % locals() sys.exit(1) inp, outp = sys.argv[1:3] model = Word2Vec(LineSentence(inp), size=400, window=5, min_count=5, workers=multiprocessing.cpu_count()) # trim unneeded model memory = use (much) less RAM model.init_sims(replace=True) model.save(outp) |
Execute “python train_word2vec_model.py wiki.en.text wiki.en.word2vec.model”:
2015-03-07 20:11:47,796 : INFO : running train_word2vec_model.py wiki.en.texx wiki.en.word2vec.model 2015-03-07 20:11:47,801 : INFO : collecting all words and their counts 2015-03-07 20:11:47,823 : INFO : PROGRESS: at sentence #0, processed 0 words and 0 word types 2015-03-07 20:12:09,816 : INFO : PROGRESS: at sentence #10000, processed 29353579 words and 430650 word types 2015-03-07 20:12:29,920 : INFO : PROGRESS: at sentence #20000, processed 54695775 words and 610833 word types 2015-03-07 20:12:45,654 : INFO : PROGRESS: at sentence #30000, processed 75344844 words and 742274 word types 2015-03-07 20:13:02,623 : INFO : PROGRESS: at sentence #40000, processed 93430415 words and 859131 word types 2015-03-07 20:13:13,613 : INFO : PROGRESS: at sentence #50000, processed 106057188 words and 935606 word types 2015-03-07 20:13:20,383 : INFO : PROGRESS: at sentence #60000, processed 114319016 words and 952771 word types 2015-03-07 20:13:25,511 : INFO : PROGRESS: at sentence #70000, processed 121263134 words and 969526 word types 2015-03-07 20:13:30,756 : INFO : PROGRESS: at sentence #80000, processed 127773799 words and 984130 word types 2015-03-07 20:13:42,144 : INFO : PROGRESS: at sentence #90000, processed 142688762 words and 1062932 word types 2015-03-07 20:13:54,513 : INFO : PROGRESS: at sentence #100000, processed 159550824 words and 1157644 word types ...... 2015-03-07 20:36:02,246 : INFO : PROGRESS: at sentence #3700000, processed 1999452503 words and 7990138 word types 2015-03-07 20:36:04,786 : INFO : PROGRESS: at sentence #3710000, processed 2002777270 words and 8002903 word types 2015-03-07 20:36:07,423 : INFO : PROGRESS: at sentence #3720000, processed 2006213923 words and 8019620 word types 2015-03-07 20:36:10,115 : INFO : PROGRESS: at sentence #3730000, processed 2009762733 words and 8035408 word types 2015-03-07 20:36:12,595 : INFO : PROGRESS: at sentence #3740000, processed 2013066196 words and 8045218 word types 2015-03-07 20:36:15,120 : INFO : PROGRESS: at sentence #3750000, processed 2016363087 words and 8057784 word types 2015-03-07 20:36:17,057 : INFO : collected 8069236 word types from a corpus of 2018886604 words and 3758076 sentences 2015-03-07 20:36:22,710 : INFO : total 1969354 word types after removing those with count<5 2015-03-07 20:36:22,710 : INFO : constructing a huffman tree from 1969354 words 2015-03-07 20:38:20,767 : INFO : built huffman tree with maximum node depth 29 2015-03-07 20:38:23,219 : INFO : resetting layer weights 2015-03-07 20:39:18,277 : INFO : training model with 4 workers on 1969354 vocabulary and 400 features, using 'skipgram'=1 'hierarchical softmax'=1 'subsample'=0 and 'negative sampling'=0 2015-03-07 20:39:33,141 : INFO : PROGRESS: at 0.01% words, alpha 0.02500, 18766 words/s 2015-03-07 20:39:34,874 : INFO : PROGRESS: at 0.05% words, alpha 0.02500, 56782 words/s 2015-03-07 20:39:35,886 : INFO : PROGRESS: at 0.07% words, alpha 0.02500, 76206 words/s 2015-03-07 20:39:41,163 : INFO : PROGRESS: at 0.08% words, alpha 0.02499, 66533 words/s 2015-03-07 20:39:43,442 : INFO : PROGRESS: at 0.09% words, alpha 0.02500, 70345 words/s 2015-03-07 20:39:47,604 : INFO : PROGRESS: at 0.11% words, alpha 0.02498, 77893 words/s ...... 2015-03-08 02:33:26,624 : INFO : PROGRESS: at 99.19% words, alpha 0.00020, 93814 words/s 2015-03-08 02:33:27,976 : INFO : PROGRESS: at 99.20% words, alpha 0.00020, 93814 words/s 2015-03-08 02:33:29,097 : INFO : PROGRESS: at 99.21% words, alpha 0.00020, 93814 words/s 2015-03-08 02:33:30,465 : INFO : PROGRESS: at 99.21% words, alpha 0.00020, 93814 words/s 2015-03-08 02:33:31,768 : INFO : PROGRESS: at 99.22% words, alpha 0.00020, 93813 words/s 2015-03-08 02:33:32,839 : INFO : PROGRESS: at 99.22% words, alpha 0.00020, 93814 words/s 2015-03-08 02:33:32,839 : INFO : PROGRESS: at 99.22% words, alpha 0.00020, 93814 words/s 2015-03-08 02:33:33,535 : INFO : reached the end of input; waiting to finish 8 outstanding jobs 2015-03-08 02:33:33,939 : INFO : PROGRESS: at 99.23% words, alpha 0.00019, 93815 words/s 2015-03-08 02:33:34,998 : INFO : PROGRESS: at 99.23% words, alpha 0.00019, 93815 words/s 2015-03-08 02:33:36,127 : INFO : PROGRESS: at 99.24% words, alpha 0.00019, 93815 words/s 2015-03-08 02:33:36,961 : INFO : training on 1994415728 words took 21258.7s, 93817 words/s 2015-03-08 02:33:36,996 : INFO : precomputing L2-norms of word weight vectors 2015-03-08 02:33:58,490 : INFO : saving Word2Vec object under wiki.en.word2vec.model, separately None 2015-03-08 02:33:58,666 : INFO : not storing attribute syn0norm 2015-03-08 02:33:58,666 : INFO : storing numpy array 'syn0' to wiki.en.word2vec.model.syn0.npy |
After about 7 hours, we get the english wikipedia model “wiki.en.word2vec.model”, but find something strange with the model:
In [1]: import gensim In [2]: model = gensim.models.Word2Vec.load("wiki.en.word2vec.model") In [3]: model.most_similar("queen") ...python2.7/site-packages/gensim/models/word2vec.py:827: RuntimeWarning: invalid value encountered in divide self.syn0norm = (self.syn0 / sqrt((self.syn0 ** 2).sum(-1))[..., newaxis]).astype(REAL) Out[3]: [(u'ahsns', nan), (u'ny\xedl', nan), (u'indradeo', nan), (u'jaimovich', nan), (u'addlepate', nan), (u'jagello', nan), (u'festenburg', nan), (u'picatic', nan), (u'tolosanum', nan), (u'mithoo', nan)] |
I thought the problem maybe caused like the gensim writer Radim Řehůřek said:
Thanks h3im.
Both numbers are identical, so there’s no problem with the dictionary/input.
I had another idea — inside the cython code, the maximum sentence length is clipped to 1,000 words. Any sentence longer than that will only consider the first 1,000 words.
In your case, you’re storing entire documents as a single sentence (1 wiki doc = 1 sentence). So this restriction may be kicking in.
Can you try increasing `DEF MAX_SENTENCE_LEN = 1000` to 10k for example, in word2vec_inner.pyx?
Or, alternatively, split documents into sentences, so each sentence is < 1,000 words long. Let me know, Radim
But found in my gensim(version ‘0.10.3’) have also set with 10k, I try a small text with the top 100000 lines of wiki.en.text “wiki.en.10w”, and train a word2vec model “wiki.en.10w.model” by the script train_word2vec_model.py, found every thing is ok:
In [1]: import gensim In [2]: model = gensim.models.Word2Vec.load("wiki.en.10w.model") In [3]: model.most_similar("queen") Out[3]: [(u'princess', 0.5976558327674866), (u'elizabeth', 0.591829776763916), (u'consort', 0.5514105558395386), (u'drottningens', 0.5454206466674805), (u'regnant', 0.5419434309005737), (u'f\xf6delsedag', 0.5259706974029541), (u'saovabha', 0.5250850915908813), (u'margrethe', 0.5195728540420532), (u'mary', 0.5035395622253418), (u'armgard', 0.5028442144393921)] In [4]: model.most_similar("man") Out[4]: [(u'woman', 0.6305292844772339), (u'boy', 0.5495858788490295), (u'girl', 0.5382533073425293), (u'bespectacled', 0.44303444027900696), (u'eutychus', 0.43531811237335205), (u'coochie', 0.42641448974609375), (u'soldier', 0.4228038191795349), (u'hater', 0.4212420582771301), (u'mannish', 0.4139400124549866), (u'bellybutton', 0.4139178991317749)] In [5]: model.similarity("man", "woman") Out[5]: 0.63052930788363182 In [6]: model.similarity("girl", "woman") Out[6]: 0.59083314898425321 |
I decided to save the original word2vec text format to debug it, so modify the train_word2vec_model.py like this:
#!/usr/bin/env python # -*- coding: utf-8 -*- import logging import os.path import sys import multiprocessing from gensim.corpora import WikiCorpus from gensim.models import Word2Vec from gensim.models.word2vec import LineSentence if __name__ == '__main__': program = os.path.basename(sys.argv[0]) logger = logging.getLogger(program) logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s') logging.root.setLevel(level=logging.INFO) logger.info("running %s" % ' '.join(sys.argv)) # check and process input arguments if len(sys.argv) < 4: print globals()['__doc__'] % locals() sys.exit(1) inp, outp1, outp2 = sys.argv[1:4] model = Word2Vec(LineSentence(inp), size=400, window=5, min_count=5, workers=multiprocessing.cpu_count()) # trim unneeded model memory = use(much) less RAM #model.init_sims(replace=True) model.save(outp1) model.save_word2vec_format(outp2, binary=False) |
Then execute “python train_word2vec_model.py wiki.en.text wiki.en.text.model wiki.en.text.vector”:
2015-03-09 22:48:29,588: INFO: running train_word2vec_model.py wiki.en.text wiki.en.text.model wiki.en.text.vector 2015-03-09 22:48:29,593: INFO: collecting all words and their counts 2015-03-09 22:48:29,607: INFO: PROGRESS: at sentence #0, processed 0 words and 0 word types 2015-03-09 22:48:50,686: INFO: PROGRESS: at sentence #10000, processed 29353579 words and 430650 word types 2015-03-09 22:49:08,476: INFO: PROGRESS: at sentence #20000, processed 54695775 words and 610833 word types 2015-03-09 22:49:22,985: INFO: PROGRESS: at sentence #30000, processed 75344844 words and 742274 word types 2015-03-09 22:49:35,607: INFO: PROGRESS: at sentence #40000, processed 93430415 words and 859131 word types 2015-03-09 22:49:44,125: INFO: PROGRESS: at sentence #50000, processed 106057188 words and 935606 word types 2015-03-09 22:49:49,185: INFO: PROGRESS: at sentence #60000, processed 114319016 words and 952771 word types 2015-03-09 22:49:53,316: INFO: PROGRESS: at sentence #70000, processed 121263134 words and 969526 word types 2015-03-09 22:49:57,268: INFO: PROGRESS: at sentence #80000, processed 127773799 words and 984130 word types 2015-03-09 22:50:07,593: INFO: PROGRESS: at sentence #90000, processed 142688762 words and 1062932 word types 2015-03-09 22:50:19,162: INFO: PROGRESS: at sentence #100000, processed 159550824 words and 1157644 word types ...... 2015-03-09 23:11:52,977: INFO: PROGRESS: at sentence #3700000, processed 1999452503 words and 7990138 word types 2015-03-09 23:11:55,367: INFO: PROGRESS: at sentence #3710000, processed 2002777270 words and 8002903 word types 2015-03-09 23:11:57,842: INFO: PROGRESS: at sentence #3720000, processed 2006213923 words and 8019620 word types 2015-03-09 23:12:00,439: INFO: PROGRESS: at sentence #3730000, processed 2009762733 words and 8035408 word types 2015-03-09 23:12:02,793: INFO: PROGRESS: at sentence #3740000, processed 2013066196 words and 8045218 word types 2015-03-09 23:12:05,178: INFO: PROGRESS: at sentence #3750000, processed 2016363087 words and 8057784 word types 2015-03-09 23:12:07,013: INFO: collected 8069236 word types from a corpus of 2018886604 words and 3758076 sentences 2015-03-09 23:12:12,230: INFO: total 1969354 word types after removing those with count<5 2015-03-09 23:12:12,230: INFO: constructing a huffman tree from 1969354 words 2015-03-09 23:14:07,415: INFO: built huffman tree with maximum node depth 29 2015-03-09 23:14:09,790: INFO: resetting layer weights 2015-03-09 23:15:04,506: INFO: training model with 4 workers on 1969354 vocabulary and 400 features, using 'skipgram'=1 'hierarchical softmax'=1 'subsample'=0 and 'negative sampling'=0 2015-03-09 23:15:19,112: INFO: PROGRESS: at 0.01% words, alpha 0.02500, 19098 words/s 2015-03-09 23:15:20,224: INFO: PROGRESS: at 0.03% words, alpha 0.02500, 37671 words/s 2015-03-09 23:15:22,305: INFO: PROGRESS: at 0.07% words, alpha 0.02500, 75393 words/s 2015-03-09 23:15:27,712: INFO: PROGRESS: at 0.08% words, alpha 0.02499, 65618 words/s 2015-03-09 23:15:29,452: INFO: PROGRESS: at 0.09% words, alpha 0.02500, 70966 words/s 2015-03-09 23:15:34,032: INFO: PROGRESS: at 0.11% words, alpha 0.02498, 77369 words/s 2015-03-09 23:15:37,249: INFO: PROGRESS: at 0.12% words, alpha 0.02498, 74935 words/s 2015-03-09 23:15:40,618: INFO: PROGRESS: at 0.14% words, alpha 0.02498, 75399 words/s 2015-03-09 23:15:42,301: INFO: PROGRESS: at 0.16% words, alpha 0.02497, 86029 words/s 2015-03-09 23:15:46,283: INFO: PROGRESS: at 0.17% words, alpha 0.02497, 83033 words/s 2015-03-09 23:15:48,374: INFO: PROGRESS: at 0.18% words, alpha 0.02497, 83370 words/s 2015-03-09 23:15:51,398: INFO: PROGRESS: at 0.19% words, alpha 0.02496, 82794 words/s 2015-03-09 23:15:55,069: INFO: PROGRESS: at 0.21% words, alpha 0.02496, 83753 words/s 2015-03-09 23:15:57,718: INFO: PROGRESS: at 0.23% words, alpha 0.02496, 85031 words/s 2015-03-09 23:16:00,106: INFO: PROGRESS: at 0.24% words, alpha 0.02495, 86567 words/s 2015-03-09 23:16:05,523: INFO: PROGRESS: at 0.26% words, alpha 0.02495, 84850 words/s 2015-03-09 23:16:06,596: INFO: PROGRESS: at 0.27% words, alpha 0.02495, 87926 words/s 2015-03-09 23:16:09,500: INFO: PROGRESS: at 0.29% words, alpha 0.02494, 88618 words/s 2015-03-09 23:16:10,714: INFO: PROGRESS: at 0.30% words, alpha 0.02494, 91023 words/s 2015-03-09 23:16:18,467: INFO: PROGRESS: at 0.32% words, alpha 0.02494, 85960 words/s 2015-03-09 23:16:19,547: INFO: PROGRESS: at 0.33% words, alpha 0.02493, 89140 words/s 2015-03-09 23:16:23,500: INFO: PROGRESS: at 0.36% words, alpha 0.02493, 92026 words/s 2015-03-09 23:16:29,738: INFO: PROGRESS: at 0.37% words, alpha 0.02491, 88180 words/s 2015-03-09 23:16:32,000: INFO: PROGRESS: at 0.40% words, alpha 0.02492, 92734 words/s 2015-03-09 23:16:34,392: INFO: PROGRESS: at 0.42% words, alpha 0.02491, 93300 words/s 2015-03-09 23:16:41,018: INFO: PROGRESS: at 0.43% words, alpha 0.02490, 89727 words/s ....... 2015-03-10 05:03:31,849: INFO: PROGRESS: at 99.20% words, alpha 0.00020, 95350 words/s 2015-03-10 05:03:32,901: INFO: PROGRESS: at 99.21% words, alpha 0.00020, 95350 words/s 2015-03-10 05:03:34,296: INFO: PROGRESS: at 99.21% words, alpha 0.00020, 95350 words/s 2015-03-10 05:03:35,635: INFO: PROGRESS: at 99.22% words, alpha 0.00020, 95349 words/s 2015-03-10 05:03:36,730: INFO: PROGRESS: at 99.22% words, alpha 0.00020, 95350 words/s 2015-03-10 05:03:37,489: INFO: reached the end of input; waiting to finish 8 outstanding jobs 2015-03-10 05:03:37,908: INFO: PROGRESS: at 99.23% words, alpha 0.00019, 95350 words/s 2015-03-10 05:03:39,028: INFO: PROGRESS: at 99.23% words, alpha 0.00019, 95350 words/s 2015-03-10 05:03:40,127: INFO: PROGRESS: at 99.24% words, alpha 0.00019, 95350 words/s 2015-03-10 05:03:40,910: INFO: training on 1994415728 words took 20916.4s, 95352 words/s 2015-03-10 05:03:41,058: INFO: saving Word2Vec object under wiki.en.text.model, separately None 2015-03-10 05:03:41,209: INFO: not storing attribute syn0norm 2015-03-10 05:03:41,209: INFO: storing numpy array 'syn0' to wiki.en.text.model.syn0.npy 2015-03-10 05:04:35,199: INFO: storing numpy array 'syn1' to wiki.en.text.model.syn1.npy 2015-03-10 05:11:25,400: INFO: storing 1969354x400 projection weights into wiki.en.text.vector |
After 7 hours, we get the text format word2vec model: wiki.en.text.vector:
1969354 400
the 0.129255 0.015725 0.049174 -0.016438 -0.018912 0.032752 0.079885 0.033669 -0.077722 -0.025709 0.012775 0.044153 0.134307 0.070499 -0.002243 0.105198 -0.016832 -0.028631 -0.124312 -0.123064 -0.116838 0.051181 -0.096058 -0.049734 0.017380 -0.101221 0.058945 0.013669 -0.012755 0.061053 0.061813 0.083655 -0.069382 -0.069868 0.066529 -0.037156 -0.072935 -0.009470 0.037412 -0.004406 0.047011 0.005033 -0.066270 -0.031815 0.023160 -0.080117 0.172918 0.065486 -0.072161 0.062875 0.019939 -0.048380 0.198152 -0.098525 0.023434 0.079439 0.045150 -0.079479 -0.051441 -0.021556 -0.024981 -0.045291 0.040284 -0.082500 0.014618 -0.071998 0.031887 0.043916 0.115783 -0.174898 0.086603 -0.023124 0.007293 -0.066576 -0.164817 -0.081223 0.058412 0.000132 0.064160 0.055848 0.029776 -0.103420 -0.007541 -0.031742 0.082533 -0.061760 -0.038961 0.001754 -0.023977 0.069616 0.095920 0.017136 0.067126 -0.111310 0.053632 0.017633 -0.003875 -0.005236 0.063151 0.039729 -0.039158 0.001415 0.021754 -0.012540 0.015070 -0.062636 -0.013605 -0.031770 0.005296 -0.078119 -0.069303 -0.080634 -0.058377 0.024398 -0.028173 0.026353 0.088662 0.018755 -0.113538 0.055538 -0.086012 -0.027708 -0.028788 0.017759 0.029293 0.047674 -0.106734 -0.134380 0.048605 -0.089583 0.029426 0.030552 0.141916 -0.022653 0.017204 -0.036059 0.061045 -0.000077 -0.076579 0.066747 0.060884 -0.072817 -0.051195 0.017663 0.043462 0.027486 -0.040694 0.025904 -0.075665 -0.000057 -0.076601 0.006704 -0.078985 -0.027770 -0.038087 0.097482 -0.001861 0.003741 -0.010897 0.042828 -0.037804 0.041546 -0.018394 -0.092459 0.010917 -0.004262 -0.113903 -0.037155 0.066674 0.096078 -0.114286 0.027908 -0.003139 -0.007529 -0.076928 0.025825 -0.090934 -0.013763 -0.057434 0.071827 -0.031783 -0.052096 0.107292 0.001864 -0.020808 0.043721 -0.024951 -0.046789 0.092858 0.037771 -0.006570 0.018282 -0.013571 -0.069215 0.019530 -0.080015 -0.078925 0.003094 0.044550 -0.046577 0.004945 -0.010885 -0.098681 0.044861 0.001618 -0.077582 -0.013834 0.024985 0.008541 -0.011861 0.023718 -0.018038 0.004162 -0.005827 -0.036836 0.081241 -0.028473 0.043937 0.005622 -0.004714 -0.029995 0.002236 -0.044635 -0.100051 0.006926 0.012636 -0.132891 -0.097755 -0.118586 0.038355 -0.034691 0.027983 0.074292 0.075199 0.033331 0.067474 -0.023996 0.024614 -0.039520 -0.110454 0.046004 -0.047849 0.023945 -0.022695 -0.053563 0.035277 0.011309 0.044326 0.026382 0.043251 0.004535 0.112228 0.022841 -0.068083 -0.122575 -0.053305 -0.005031 -0.078522 -0.044147 0.083576 0.005531 -0.063187 -0.032841 -0.067989 0.111359 0.125724 0.074154 0.040301 0.082240 0.015494 -0.066648 0.091087 0.095067 -0.059386 0.003256 -0.006734 -0.058248 0.020567 -0.006784 -0.017885 0.146956 -0.014679 -0.019453 -0.009875 -0.031508 0.002070 -0.002830 0.060321 0.056237 -0.080740 0.017465 0.016851 -0.067723 -0.061582 0.028104 0.067970 -0.024162 0.027407 0.075006 0.084483 -0.011534 0.129151 -0.072387 0.083424 -0.009501 0.041553 0.016603 0.002965 -0.027677 -0.110295 0.033986 0.028290 0.049621 0.001125 -0.018187 -0.001404 -0.024074 0.025322 -0.023594 -0.076071 0.107616 0.091381 -0.116943 0.109416 -0.045990 0.024346 0.152548 -0.010692 0.120887 -0.012670 -0.044978 -0.050880 -0.012535 -0.080475 0.036055 -0.050770 0.040417 -0.030957 -0.013680 0.001236 0.010180 -0.040136 -0.118249 0.017540 0.107725 -0.118492 -0.032438 -0.009072 -0.081345 -0.022384 0.045453 -0.008754 -0.098392 -0.113199 0.023589 0.017172 0.108523 -0.029611 0.041029 0.005958 0.010155 -0.036815 0.073110 -0.048424 -0.029022 -0.016711 -0.126587 0.045923 0.018589 0.113195 -0.002896 -0.051350 -0.007355 0.012278 0.093481 0.093676 -0.145230 -0.068279 -0.068407 0.008837 -0.012186 -0.136079 0.087961 0.041402 -0.058727 0.003030 0.008455 -0.062826 -0.139834 -0.014068 -0.115521 -0.117215 0.093502 0.026607 0.095726 -0.016339 0.033879 -0.022889 0.023565 0.028705
…
In the ipython, we test it like this, note that the wiki.en.text.vector is about 7G, and the load time cost long time:
In [2]: import gensim In [3]: model = gensim.models.Word2Vec.load_word2vec_format("wiki.en.text.vector", binary=False) In [4]: model.most_similar("queen") Out[4]: [(u'princess', 0.5760838389396667), (u'hyoui', 0.5671186447143555), (u'janggyung', 0.5598698854446411), (u'king', 0.5556215047836304), (u'dollallolla', 0.5540223121643066), (u'loranella', 0.5522741079330444), (u'ramphaiphanni', 0.5310937166213989), (u'jeheon', 0.5298476219177246), (u'soheon', 0.5243583917617798), (u'coronation', 0.5217245221138)] In [5]: model.most_similar("man") Out[5]: [(u'woman', 0.7120707035064697), (u'girl', 0.58659827709198), (u'handsome', 0.5637181997299194), (u'boy', 0.5425317287445068), (u'villager', 0.5084836483001709), (u'mustachioed', 0.49287813901901245), (u'mcgucket', 0.48355430364608765), (u'spider', 0.4804879426956177), (u'policeman', 0.4780033826828003), (u'stranger', 0.4750771224498749)] In [6]: model.most_similar("woman") Out[6]: [(u'man', 0.7120705842971802), (u'girl', 0.6736541986465454), (u'prostitute', 0.5765659809112549), (u'divorcee', 0.5429972410202026), (u'person', 0.5276163816452026), (u'schoolgirl', 0.5102938413619995), (u'housewife', 0.48748138546943665), (u'lover', 0.4858251214027405), (u'handsome', 0.4773051142692566), (u'boy', 0.47445783019065857)] In [8]: model.similarity("woman", "man") Out[8]: 0.71207063453821218 In [10]: model.doesnt_match("breakfast cereal dinner lunch".split()) Out[10]: 'cereal' In [11]: model.similarity("woman", "girl") Out[11]: 0.67365416785207421 In [13]: model.most_similar("frog") Out[13]: [(u'toad', 0.6868536472320557), (u'barycragus', 0.6607867479324341), (u'grylio', 0.626731276512146), (u'heckscheri', 0.6208407878875732), (u'clamitans', 0.6150864362716675), (u'coplandi', 0.612680196762085), (u'pseudacris', 0.6108512878417969), (u'litoria', 0.6084023714065552), (u'raniformis', 0.6044802665710449), (u'watjulumensis', 0.6043726205825806)] |
Everything is ok, but when we load the numpy model, we still met the “RuntimeWarning: invalid value encountered in divide” problem:
In [1]: import gensim In [2]: model = gensim.models.Word2Vec.load("wiki.en.text.model") In [3]: model.most_similar("man") ... RuntimeWarning: invalid value encountered in divide self.syn0norm = (self.syn0 / sqrt((self.syn0 ** 2).sum(-1))[..., newaxis]).astype(REAL) Out[3]: [(u'ahsns', nan), (u'ny\xedl', nan), (u'indradeo', nan), (u'jaimovich', nan), (u'addlepate', nan), (u'jagello', nan), (u'festenburg', nan), (u'picatic', nan), (u'tolosanum', nan), (u'mithoo', nan)] |
If you see here and resolve the problem I met, pls point out for me, thanks a lot.
Posted by TextMiner
Pingback: Applying GloVe Vectors using gensim word2vec | Nam Khanh Tran
hello
I am new in gensim,I can’t install and use it,I want to semantic clustering text data so I need to doc2vector then extract semantic words of documents,I read all of website about gensim but I don’t know how to use gensim and I’m confused,
please recommend me a source that explain clearly step by step basically.
You can use Anaconda data science tools and install gensim inside it. Other ways is to use pip install gensim command in your terminal to install Gensim package.
did you figure out the issue?
Instead of using the model in text format (which is very slow) it is better to convert the model to binary:
model = word2vec.Word2Vec.load_word2vec_format(‘wiki.en.text.vector’, binary=false)
model.save_word2vec_format(‘wiki.en.bin.vector’, binary=true)
Best Regards
thanks a lot
Is the problem solved now?
can you share the build model for use and test….
Hi,
I get error “print globals()[‘__doc__’] / locals()
TypeError: unsupported operand type(s) for /: ‘NoneType’ and ‘dict'”.
Please help me fix it.
Thanks
Hi,
I have the same Error, do you have fixed it?
Or does somebody else know how to solve this Problem?
Tanks
You have to run it using specifically python2.7
Just add parentheses ‘()’ in print function, like this:
print( globals()[‘__doc__’] % locals() )
could you tell me how do i select the right number of iterations? or how do i know i need this many iterations for word vectors to get stabilized? many thanks.
Hi,
Is there a way to search for articles in Wikipedia XML Dump based on its Categories and the only extract those articles that belongs to a particular categories and to convert it to plain text? Thanks
sorry, cannot help you for that
#TextMiner @TextMiner
plz help
File “C:\Program Files\Python36-32\lib\site-packages\gensim\models\base_any2ve
c.py”, line 601, in _check_training_sanity
raise RuntimeError(“you must first build vocabulary before training the mode
l”)
RuntimeError: you must first build vocabulary before training the model
YES, You can use Wikipedia API or alternative tools to extract articles which belong to a specific category
How can I maintain the punctuation on the process_wiki.py?
write by your self
Can you share the .txt file with the one Wikipedia article per line?
Thanks!
sorry, cannot help this
When I try to create .text file from Farsi wikipedia it throws such exception:
Process InputQueue-8:
Traceback (most recent call last):
File “/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/process.py”, line 258, in _bootstrap
self.run()
File “/Library/Python/2.7/site-packages/gensim-0.13.2-py2.7-macosx-10.11-intel.egg/gensim/utils.py”, line 822, in run
wrapped_chunk = [list(chunk)]
File “/Library/Python/2.7/site-packages/gensim-0.13.2-py2.7-macosx-10.11-intel.egg/gensim/corpora/wikicorpus.py”, line 298, in
texts = ((text, self.lemmatize, title, pageid) for title, text, pageid in extract_pages(bz2.BZ2File(self.fname), self.filter_namespaces))
File “/Library/Python/2.7/site-packages/gensim-0.13.2-py2.7-macosx-10.11-intel.egg/gensim/corpora/wikicorpus.py”, line 212, in extract_pages
for elem in elems:
File “/Library/Python/2.7/site-packages/gensim-0.13.2-py2.7-macosx-10.11-intel.egg/gensim/corpora/wikicorpus.py”, line 197, in
elems = (elem for _, elem in iterparse(f, events=(“end”,)))
File “”, line 107, in next
ParseError: no element found: line 44, column 0
did you whats the issue?
It seems the blank line in the text? sorry, can’t help you more
Hai can anyone suggest me how to trainn word2vec model for Korean(kowiki)
In [1]: import gensim
C:\ProgramData\Anaconda2\lib\site-packages\gensim\utils.py:860: UserWarning: detected Windows; alias
ing chunkize to chunkize_serial
warnings.warn(“detected Windows; aliasing chunkize to chunkize_serial”)
In [2]: model = gensim.models.word2vec.Word2Vec.load(“F:\wikimedia\wiki.en.text.vector”)
—————————————————————————
UnpicklingError Traceback (most recent call last)
in ()
—-> 1 model = gensim.models.word2vec.Word2Vec.load(“F:\wikimedia\wiki.en.text.vector”)
C:\ProgramData\Anaconda2\lib\site-packages\gensim\models\word2vec.pyc in load(cls, *args, **kwargs)
1408 @classmethod
1409 def load(cls, *args, **kwargs):
-> 1410 model = super(Word2Vec, cls).load(*args, **kwargs)
1411 # update older models
1412 if hasattr(model, ‘table’):
C:\ProgramData\Anaconda2\lib\site-packages\gensim\utils.pyc in load(cls, fname, mmap)
269 compress, subname = SaveLoad._adapt_by_suffix(fname)
270
–> 271 obj = unpickle(fname)
272 obj._load_specials(fname, mmap, compress, subname)
273 logger.info(“loaded %s”, fname)
C:\ProgramData\Anaconda2\lib\site-packages\gensim\utils.pyc in unpickle(fname)
933 return _pickle.load(f, encoding=’latin1′)
934 else:
–> 935 return _pickle.loads(f.read())
936
937
UnpicklingError: unpickling stack underflow
please tell me How to solve the problem?
Actually, I have answered your question on 52nlp.cn, and cannot help you more.
Hey, first of all thank you for the tutorial.
I, as a non-programmer, already have problems at the beginning:
What do I exactly need to do to “Execute “python process_wiki.py enwiki-latest-pages-articles.xml.bz2 wiki.en.text” (under the first picture). Do I just paste this in the windows command line?
I think you should learn something about python first
Thank you so much !
Hi, I am using anaconda 3. I am trying to run this program in the internal terminal of anaconda, but every time I try to run the code it says no file or directory named process_wiki.py, but I am running it from the same directory where the program is stored.
sorry that was a silly mistake…
this is the current problem…
Traceback (most recent call last):
File “process_wiki.py”, line 30, in
output.write(b’ ‘.join(text).decode(‘utf-8’) + ‘\n’)
TypeError: sequence item 0: expected a bytes-like object, str found
replaced b’ ‘.join(text).decode(‘utf-8’)
with str(b”, ‘utf-8’)
you are writing empty string str(b”, ‘utf-8’) …how it will work it will show the empty file……
if I replace it with said one it output empty file and if keep original code it shows same error …Plz help
By adding this line I think I solved that problem “text = [x.encode(‘utf-8’) for x in text]”:
for text in wiki.get_texts():
text = [x.encode(‘utf-8′) for x in text]
if six.PY3:
output.write(b’ ‘.join(text).decode(‘utf-8’) + ‘\n’)
Hello, May I know why you do not share the extracted model?
I spent one day to extract the model but at the end, I encountered an error. It would be very helpful for all if you share your extracted model. Many thanks.
Hi!
I need to process the text, so I was using your process_wiki.py code. But it is not working and I can see no logging information even after several hours. Any suggestions?
Can you tell me the method you use the script? or any error info?
Thanks for the great sharing.
Some experience to share when I run the code:
1. encoding method: in my machine, I need to change line 1623 in word2vec.py(including class LineSentence) to `line = utils.to_unicode(line, errors=’ignore’).split()`, otherwise it will always complainging about `utf-8` encoding problem
2. I manually split all the text into sentence no more than 990 words followed the `MAX_SENTENCE_LEN = 1000` definition and it works with the `model.most_similar(“queen”)`
Hope it can provide some help
Hi Radim,
I think I have the solution to your problem with wikipedia text.
I think what you are looking for is to decode special characters like french words.
“`
>>> from unidecode import unidecode
>>> unidecode(u’北京’)
‘Bei Jing’
>>> unidecode(u’Škoda’)
‘Skoda’
“`
Reference:
https://stackoverflow.com/questions/2365411/python-convert-unicode-to-ascii-without-errors
Let me know if I can provide a pr at the gensim repo
Hi
Thanks for your awesome tutorial.
I guess to consider bigrams and trigrams, we should change this part of code:
model = Word2Vec(LineSentence(inp), size=400, window=5, min_count=5,
workers=multiprocessing.cpu_count())
to this:
common_terms = [“of”, “with”, “without”, “and”, “or”, “the”, “a”]
bigram_transformer = gensim.models.Phrases(LineSentence(inp), common_terms=common_terms)
trigram_transformer = gensim.models.Phrases(bigram_transformer[LineSentence(inp)], common_terms=common_terms)
model = Word2Vec(trigram_transformer[LineSentence(inp)], size=400, window=5, min_count=5,
workers=multiprocessing.cpu_count())
I will try and will let you know if it works.
yes, I tried, it works perfectly.
I was trying to process the Wikipedia corpus…
Tried both ways and variations of them to counter the errors shown, but nothing is working…
The errors are following…
Traceback (most recent call last):
File “process_wiki.py”, line 30, in
output.write(b’ ‘.join(text).decode(‘utf-8’) + ‘\n’)
TypeError: sequence item 0: expected bytes, bytearray, or an object with the buffer interface, str found
Traceback (most recent call last):
File “process_wiki.py”, line 30, in
output.write(str(b”.decode(‘utf-8’)).join(text) + ‘\n’)
File “D:\Python34\lib\encodings\cp1252.py”, line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: ‘charmap’ codec can’t encode characters in position 1266-1267: character maps to
Traceback (most recent call last):
File “process_wiki.py”, line 32, in
output.write(space.join(map(lambda x:x.decode(“utf-8”), text)) + ‘\n’)
File “process_wiki.py”, line 32, in
output.write(space.join(map(lambda x:x.decode(“utf-8”), text)) + ‘\n’)
AttributeError: ‘str’ object has no attribute ‘decode’
Traceback (most recent call last):
File “process_wiki.py”, line 32, in
output.write(space.join(map(lambda x:x, text)) + ‘\n’)
File “D:\Python34\lib\encodings\cp1252.py”, line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: ‘charmap’ codec can’t encode characters in position 1470-1471: character maps to
can you share the build model for use and test….
sorry, the model trained long time ago, I cannot find it. you can ref: https://github.com/mmihaltz/word2vec-GoogleNews-vectors
File “C:\Program Files\Python36-32\lib\site-packages\gensim\models\base_any2ve
c.py”, line 601, in _check_training_sanity
raise RuntimeError(“you must first build vocabulary before training the mode
l”)
RuntimeError: you must first build vocabulary before training the model
I checked the text file wiki.en.text is empty ….but i am confused why it is coming empty….i used process_wiki.py
plz help
sorry, I just used python2.7 and under ubuntu and mac os, not on windows, cannot help you
How can we save the list of words in csv or excel file after getting output from model
why save into csv or excel? you can get words list by gensim model
What is the difference in the models in out1 and out2 being saved here:
“`model.save(outp1)
model.save_word2vec_format(outp2, binary=False)“`
Where can I find Word2Vec trained model on Wikipedia Spanish?
https://github.com/aitoralmeida/spanish_word2vec
How to create own corpus and use that for Doc2vec?
train file and test file is available in local drive .
how to access these two files using gensim?
Where should i run the process_wiki.py?Thanks
After download wiki zip file, you can run it anywhere, where the process_wiki.py script and the wiki bz data are in same directory: python process_wiki.py enwiki-latest-pages-articles.xml.bz2 wiki.en.text
Hi, First of thank you for this wonderful indicative taken. I am trying to run script “train_word2vec_model.py” (Modified one) after 20 Minute of running it is giving me error: MemoryError: unable to allocate array with shape and data type float32.