HomeNLPTraining Word2Vec Model on English Wikipedia by Gensim
Deep Learning Specialization on Coursera

After learning word2vec and glove, a natural way to think about them is training a related model on a larger corpus, and english wikipedia is an ideal choice for this task. After google the related keywords like “word2vec wikipedia”, “gensim word2vec wikipedia”, I found in the gensim google groups, the discuss in the post “training word2vec on full Wikipedia” give a proper solution for this task. Note that there is another choice by the wiki2vec, but I thought the gensim word2vec way is very simple and effective.

The data I downloaded from english wikipedia is in here(download data is 2015-03-01, about 11G):

https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2

First, we need process the xml format wikipedia to text format, copy the code from the post in the process_wiki.py:
Note: I have modified a new version to compatible python2.x and python3.x (2017.05.01)

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# Author: Pan Yang (panyangnlp@gmail.com)
# Copyrigh 2017
 
from __future__ import print_function
 
import logging
import os.path
import six
import sys
 
from gensim.corpora import WikiCorpus
 
if __name__ == '__main__':
    program = os.path.basename(sys.argv[0])
    logger = logging.getLogger(program)
 
    logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s')
    logging.root.setLevel(level=logging.INFO)
    logger.info("running %s" % ' '.join(sys.argv))
 
    # check and process input arguments
    if len(sys.argv) != 3:
        print("Using: python process_wiki.py enwiki.xxx.xml.bz2 wiki.en.text")
        sys.exit(1)
    inp, outp = sys.argv[1:3]
    space = " "
    i = 0
 
    output = open(outp, 'w')
    wiki = WikiCorpus(inp, lemmatize=False, dictionary={})
    for text in wiki.get_texts():
        if six.PY3:
            output.write(b' '.join(text).decode('utf-8') + '\n')
        #   ###another method###
        #    output.write(
        #            space.join(map(lambda x:x.decode("utf-8"), text)) + '\n')
        else:
            output.write(space.join(text) + "\n")
        i = i + 1
        if (i % 10000 == 0):
            logger.info("Saved " + str(i) + " articles")
 
    output.close()
    logger.info("Finished Saved " + str(i) + " articles")

Note there is a little different here:

wiki = WikiCorpus(inp, lemmatize=False, dictionary={})

The original is: wiki = WikiCorpus(inp, dictionary={})

we set the lemmatize to False to not use pattern, cause pattern could slow the process deeply.

Execute “python process_wiki.py enwiki-latest-pages-articles.xml.bz2 wiki.en.text”, we get:

2015-03-07 15:08:39,181: INFO: running process_enwiki.py enwiki-latest-pages-articles.xml.bz2 wiki.en.text
2015-03-07 15:11:12,860: INFO: Saved 10000 articles
2015-03-07 15:13:25,369: INFO: Saved 20000 articles
2015-03-07 15:15:19,771: INFO: Saved 30000 articles
2015-03-07 15:16:58,424: INFO: Saved 40000 articles
2015-03-07 15:18:12,374: INFO: Saved 50000 articles
2015-03-07 15:19:03,213: INFO: Saved 60000 articles
2015-03-07 15:19:47,656: INFO: Saved 70000 articles
2015-03-07 15:20:29,135: INFO: Saved 80000 articles
2015-03-07 15:22:02,365: INFO: Saved 90000 articles
2015-03-07 15:23:40,141: INFO: Saved 100000 articles
.....
2015-03-07 19:33:16,549: INFO: Saved 3700000 articles
2015-03-07 19:33:49,493: INFO: Saved 3710000 articles
2015-03-07 19:34:23,442: INFO: Saved 3720000 articles
2015-03-07 19:34:57,984: INFO: Saved 3730000 articles
2015-03-07 19:35:31,976: INFO: Saved 3740000 articles
2015-03-07 19:36:05,790: INFO: Saved 3750000 articles
2015-03-07 19:36:32,392: INFO: finished iterating over Wikipedia corpus of 3758076 documents with 2018886604 positions (total 15271374 articles, 2075130438 positions before pruning articles shorter than 50 words)
2015-03-07 19:36:32,394: INFO: Finished Saved 3758076 articles

After processing about 5 hours on my macpro(cpu is 4 core and RAM is 16G), we get a 12G wiki.en.text, one article one line, like this, the punctuation has been discarded:

anarchism is collection of movements and ideologies that hold the state to be undesirable unnecessary or harmful these movements advocate some form of stateless society instead often based on self governed voluntary institutions or non hierarchical free associations although anti statism is central to anarchism as political philosophy anarchism also entails rejection of and often hierarchical organisation in general as an anti dogmatic philosophy anarchism draws on many currents of thought and strategy anarchism does not offer fixed body of doctrine from single particular world view instead fluxing and flowing as philosophy there are many types and traditions of anarchism not all of which are mutually exclusive anarchist schools of thought can differ fundamentally supporting anything from extreme individualism to complete collectivism strains of anarchism have often been divided into the categories of social and individualist anarchism or similar dual classifications anarchism is usually considered radical left wing ideology and much of anarchist economics and anarchist legal philosophy reflect anti authoritarian interpretations of communism collectivism syndicalism mutualism or participatory economics etymology and terminology the term anarchism is compound word composed from the word anarchy and the suffix ism themselves derived respectively from the greek anarchy from anarchos meaning one without rulers from the privative prefix ἀν an without and archos leader ruler cf archon or arkhē authority sovereignty realm magistracy and the suffix or ismos isma from the verbal infinitive suffix...

The second step is training the word2vec model from the text, you can use the original word2vc binary or glove binary to train related model like the tex8 file, but seems it’s very slow. Like the post, we use the gensim word2vec model to train the english wikipedia model, copy the code from the post in the train_word2vec_model.py:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import logging
import os.path
import sys
import multiprocessing
 
from gensim.corpora import  WikiCorpus
from gensim.models import Word2Vec
from gensim.models.word2vec import LineSentence
 
 
if __name__ == '__main__':
    program = os.path.basename(sys.argv[0])
    logger = logging.getLogger(program)
 
    logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s')
    logging.root.setLevel(level=logging.INFO)
    logger.info("running %s" % ' '.join(sys.argv))
 
    # check and process input arguments
 
    if len(sys.argv) < 3:
        print globals()['__doc__'] % locals()
        sys.exit(1)
    inp, outp = sys.argv[1:3]
 
    model = Word2Vec(LineSentence(inp), size=400, window=5, min_count=5, workers=multiprocessing.cpu_count())
 
    # trim unneeded model memory = use (much) less RAM
    model.init_sims(replace=True)
 
    model.save(outp)

Execute “python train_word2vec_model.py wiki.en.text wiki.en.word2vec.model”:

2015-03-07 20:11:47,796 : INFO :  running train_word2vec_model.py wiki.en.texx wiki.en.word2vec.model
2015-03-07 20:11:47,801 : INFO :  collecting all words and their counts
2015-03-07 20:11:47,823 : INFO :  PROGRESS: at sentence #0, processed 0 words and 0 word types
2015-03-07 20:12:09,816 : INFO :  PROGRESS: at sentence #10000, processed 29353579 words and 430650 word types
2015-03-07 20:12:29,920 : INFO :  PROGRESS: at sentence #20000, processed 54695775 words and 610833 word types
2015-03-07 20:12:45,654 : INFO :  PROGRESS: at sentence #30000, processed 75344844 words and 742274 word types
2015-03-07 20:13:02,623 : INFO :  PROGRESS: at sentence #40000, processed 93430415 words and 859131 word types
2015-03-07 20:13:13,613 : INFO :  PROGRESS: at sentence #50000, processed 106057188 words and 935606 word types
2015-03-07 20:13:20,383 : INFO :  PROGRESS: at sentence #60000, processed 114319016 words and 952771 word types
2015-03-07 20:13:25,511 : INFO :  PROGRESS: at sentence #70000, processed 121263134 words and 969526 word types
2015-03-07 20:13:30,756 : INFO :  PROGRESS: at sentence #80000, processed 127773799 words and 984130 word types
2015-03-07 20:13:42,144 : INFO :  PROGRESS: at sentence #90000, processed 142688762 words and 1062932 word types
2015-03-07 20:13:54,513 : INFO :  PROGRESS: at sentence #100000, processed 159550824 words and 1157644 word types
......
2015-03-07 20:36:02,246 : INFO :  PROGRESS: at sentence #3700000, processed 1999452503 words and 7990138 word types
2015-03-07 20:36:04,786 : INFO :  PROGRESS: at sentence #3710000, processed 2002777270 words and 8002903 word types
2015-03-07 20:36:07,423 : INFO :  PROGRESS: at sentence #3720000, processed 2006213923 words and 8019620 word types
2015-03-07 20:36:10,115 : INFO :  PROGRESS: at sentence #3730000, processed 2009762733 words and 8035408 word types
2015-03-07 20:36:12,595 : INFO :  PROGRESS: at sentence #3740000, processed 2013066196 words and 8045218 word types
2015-03-07 20:36:15,120 : INFO :  PROGRESS: at sentence #3750000, processed 2016363087 words and 8057784 word types
2015-03-07 20:36:17,057 : INFO :  collected 8069236 word types from a corpus of 2018886604 words and 3758076 sentences
2015-03-07 20:36:22,710 : INFO :  total 1969354 word types after removing those with count<5
2015-03-07 20:36:22,710 : INFO :  constructing a huffman tree from 1969354 words
2015-03-07 20:38:20,767 : INFO :  built huffman tree with maximum node depth 29
2015-03-07 20:38:23,219 : INFO :  resetting layer weights
2015-03-07 20:39:18,277 : INFO :  training model with 4 workers on 1969354 vocabulary and 400 features, using 'skipgram'=1 'hierarchical softmax'=1 'subsample'=0 and 'negative sampling'=0
2015-03-07 20:39:33,141 : INFO :  PROGRESS: at 0.01% words, alpha 0.02500, 18766 words/s
2015-03-07 20:39:34,874 : INFO :  PROGRESS: at 0.05% words, alpha 0.02500, 56782 words/s
2015-03-07 20:39:35,886 : INFO :  PROGRESS: at 0.07% words, alpha 0.02500, 76206 words/s
2015-03-07 20:39:41,163 : INFO :  PROGRESS: at 0.08% words, alpha 0.02499, 66533 words/s
2015-03-07 20:39:43,442 : INFO :  PROGRESS: at 0.09% words, alpha 0.02500, 70345 words/s
2015-03-07 20:39:47,604 : INFO :  PROGRESS: at 0.11% words, alpha 0.02498, 77893 words/s
......
2015-03-08 02:33:26,624 : INFO :  PROGRESS: at 99.19% words, alpha 0.00020, 93814 words/s
2015-03-08 02:33:27,976 : INFO :  PROGRESS: at 99.20% words, alpha 0.00020, 93814 words/s
2015-03-08 02:33:29,097 : INFO :  PROGRESS: at 99.21% words, alpha 0.00020, 93814 words/s
2015-03-08 02:33:30,465 : INFO :  PROGRESS: at 99.21% words, alpha 0.00020, 93814 words/s
2015-03-08 02:33:31,768 : INFO :  PROGRESS: at 99.22% words, alpha 0.00020, 93813 words/s
2015-03-08 02:33:32,839 : INFO :  PROGRESS: at 99.22% words, alpha 0.00020, 93814 words/s
2015-03-08 02:33:32,839 : INFO :  PROGRESS: at 99.22% words, alpha 0.00020, 93814 words/s
2015-03-08 02:33:33,535 : INFO :  reached the end of input; waiting to finish 8 outstanding jobs
2015-03-08 02:33:33,939 : INFO :  PROGRESS: at 99.23% words, alpha 0.00019, 93815 words/s
2015-03-08 02:33:34,998 : INFO :  PROGRESS: at 99.23% words, alpha 0.00019, 93815 words/s
2015-03-08 02:33:36,127 : INFO :  PROGRESS: at 99.24% words, alpha 0.00019, 93815 words/s
2015-03-08 02:33:36,961 : INFO :  training on 1994415728 words took 21258.7s, 93817 words/s
2015-03-08 02:33:36,996 : INFO :  precomputing L2-norms of word weight vectors
2015-03-08 02:33:58,490 : INFO :  saving Word2Vec object under wiki.en.word2vec.model, separately None
2015-03-08 02:33:58,666 : INFO :  not storing attribute syn0norm
2015-03-08 02:33:58,666 : INFO :  storing numpy array 'syn0' to wiki.en.word2vec.model.syn0.npy

After about 7 hours, we get the english wikipedia model “wiki.en.word2vec.model”, but find something strange with the model:

In [1]: import gensim
 
In [2]: model = gensim.models.Word2Vec.load("wiki.en.word2vec.model")
 
In [3]: model.most_similar("queen")
...python2.7/site-packages/gensim/models/word2vec.py:827: RuntimeWarning: invalid value encountered in divide
  self.syn0norm = (self.syn0 / sqrt((self.syn0 ** 2).sum(-1))[..., newaxis]).astype(REAL)
Out[3]: 
[(u'ahsns', nan),
 (u'ny\xedl', nan),
 (u'indradeo', nan),
 (u'jaimovich', nan),
 (u'addlepate', nan),
 (u'jagello', nan),
 (u'festenburg', nan),
 (u'picatic', nan),
 (u'tolosanum', nan),
 (u'mithoo', nan)]

I thought the problem maybe caused like the gensim writer Radim Řehůřek said:

Thanks h3im.

Both numbers are identical, so there’s no problem with the dictionary/input.

I had another idea — inside the cython code, the maximum sentence length is clipped to 1,000 words. Any sentence longer than that will only consider the first 1,000 words.

In your case, you’re storing entire documents as a single sentence (1 wiki doc = 1 sentence). So this restriction may be kicking in.

Can you try increasing `DEF MAX_SENTENCE_LEN = 1000` to 10k for example, in word2vec_inner.pyx?

Or, alternatively, split documents into sentences, so each sentence is < 1,000 words long. Let me know, Radim

But found in my gensim(version ‘0.10.3’) have also set with 10k, I try a small text with the top 100000 lines of wiki.en.text “wiki.en.10w”, and train a word2vec model “wiki.en.10w.model” by the script train_word2vec_model.py, found every thing is ok:

In [1]: import gensim
 
In [2]: model = gensim.models.Word2Vec.load("wiki.en.10w.model")
 
In [3]: model.most_similar("queen")
Out[3]: 
[(u'princess', 0.5976558327674866),
 (u'elizabeth', 0.591829776763916),
 (u'consort', 0.5514105558395386),
 (u'drottningens', 0.5454206466674805),
 (u'regnant', 0.5419434309005737),
 (u'f\xf6delsedag', 0.5259706974029541),
 (u'saovabha', 0.5250850915908813),
 (u'margrethe', 0.5195728540420532),
 (u'mary', 0.5035395622253418),
 (u'armgard', 0.5028442144393921)]
 
In [4]: model.most_similar("man")
Out[4]: 
[(u'woman', 0.6305292844772339),
 (u'boy', 0.5495858788490295),
 (u'girl', 0.5382533073425293),
 (u'bespectacled', 0.44303444027900696),
 (u'eutychus', 0.43531811237335205),
 (u'coochie', 0.42641448974609375),
 (u'soldier', 0.4228038191795349),
 (u'hater', 0.4212420582771301),
 (u'mannish', 0.4139400124549866),
 (u'bellybutton', 0.4139178991317749)]
 
In [5]: model.similarity("man", "woman")
Out[5]: 0.63052930788363182
 
In [6]: model.similarity("girl", "woman")
Out[6]: 0.59083314898425321

I decided to save the original word2vec text format to debug it, so modify the train_word2vec_model.py like this:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
 
import logging
import os.path
import sys
import multiprocessing
 
from gensim.corpora import WikiCorpus
from gensim.models import Word2Vec
from gensim.models.word2vec import LineSentence
 
if __name__ == '__main__':
    program = os.path.basename(sys.argv[0])
    logger = logging.getLogger(program)
 
    logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s')
    logging.root.setLevel(level=logging.INFO)
    logger.info("running %s" % ' '.join(sys.argv))
 
    # check and process input arguments
    if len(sys.argv) < 4:
        print globals()['__doc__'] % locals()
        sys.exit(1)
    inp, outp1, outp2 = sys.argv[1:4]
 
    model = Word2Vec(LineSentence(inp), size=400, window=5, min_count=5,
            workers=multiprocessing.cpu_count())
 
    # trim unneeded model memory = use(much) less RAM
    #model.init_sims(replace=True)
    model.save(outp1)
    model.save_word2vec_format(outp2, binary=False)

Then execute “python train_word2vec_model.py wiki.en.text wiki.en.text.model wiki.en.text.vector”:

2015-03-09 22:48:29,588: INFO: running train_word2vec_model.py wiki.en.text wiki.en.text.model wiki.en.text.vector
2015-03-09 22:48:29,593: INFO: collecting all words and their counts
2015-03-09 22:48:29,607: INFO: PROGRESS: at sentence #0, processed 0 words and 0 word types
2015-03-09 22:48:50,686: INFO: PROGRESS: at sentence #10000, processed 29353579 words and 430650 word types
2015-03-09 22:49:08,476: INFO: PROGRESS: at sentence #20000, processed 54695775 words and 610833 word types
2015-03-09 22:49:22,985: INFO: PROGRESS: at sentence #30000, processed 75344844 words and 742274 word types
2015-03-09 22:49:35,607: INFO: PROGRESS: at sentence #40000, processed 93430415 words and 859131 word types
2015-03-09 22:49:44,125: INFO: PROGRESS: at sentence #50000, processed 106057188 words and 935606 word types
2015-03-09 22:49:49,185: INFO: PROGRESS: at sentence #60000, processed 114319016 words and 952771 word types
2015-03-09 22:49:53,316: INFO: PROGRESS: at sentence #70000, processed 121263134 words and 969526 word types
2015-03-09 22:49:57,268: INFO: PROGRESS: at sentence #80000, processed 127773799 words and 984130 word types
2015-03-09 22:50:07,593: INFO: PROGRESS: at sentence #90000, processed 142688762 words and 1062932 word types
2015-03-09 22:50:19,162: INFO: PROGRESS: at sentence #100000, processed 159550824 words and 1157644 word 
types
......
2015-03-09 23:11:52,977: INFO: PROGRESS: at sentence #3700000, processed 1999452503 words and 7990138 word types
2015-03-09 23:11:55,367: INFO: PROGRESS: at sentence #3710000, processed 2002777270 words and 8002903 word types
2015-03-09 23:11:57,842: INFO: PROGRESS: at sentence #3720000, processed 2006213923 words and 8019620 word types
2015-03-09 23:12:00,439: INFO: PROGRESS: at sentence #3730000, processed 2009762733 words and 8035408 word types
2015-03-09 23:12:02,793: INFO: PROGRESS: at sentence #3740000, processed 2013066196 words and 8045218 word types
2015-03-09 23:12:05,178: INFO: PROGRESS: at sentence #3750000, processed 2016363087 words and 8057784 word types
2015-03-09 23:12:07,013: INFO: collected 8069236 word types from a corpus of 2018886604 words and 3758076 sentences
2015-03-09 23:12:12,230: INFO: total 1969354 word types after removing those with count<5
2015-03-09 23:12:12,230: INFO: constructing a huffman tree from 1969354 words
2015-03-09 23:14:07,415: INFO: built huffman tree with maximum node depth 29
2015-03-09 23:14:09,790: INFO: resetting layer weights
2015-03-09 23:15:04,506: INFO: training model with 4 workers on 1969354 vocabulary and 400 features, using 'skipgram'=1 'hierarchical softmax'=1 'subsample'=0 and 'negative sampling'=0
2015-03-09 23:15:19,112: INFO: PROGRESS: at 0.01% words, alpha 0.02500, 19098 words/s
2015-03-09 23:15:20,224: INFO: PROGRESS: at 0.03% words, alpha 0.02500, 37671 words/s
2015-03-09 23:15:22,305: INFO: PROGRESS: at 0.07% words, alpha 0.02500, 75393 words/s
2015-03-09 23:15:27,712: INFO: PROGRESS: at 0.08% words, alpha 0.02499, 65618 words/s
2015-03-09 23:15:29,452: INFO: PROGRESS: at 0.09% words, alpha 0.02500, 70966 words/s
2015-03-09 23:15:34,032: INFO: PROGRESS: at 0.11% words, alpha 0.02498, 77369 words/s
2015-03-09 23:15:37,249: INFO: PROGRESS: at 0.12% words, alpha 0.02498, 74935 words/s
2015-03-09 23:15:40,618: INFO: PROGRESS: at 0.14% words, alpha 0.02498, 75399 words/s
2015-03-09 23:15:42,301: INFO: PROGRESS: at 0.16% words, alpha 0.02497, 86029 words/s
2015-03-09 23:15:46,283: INFO: PROGRESS: at 0.17% words, alpha 0.02497, 83033 words/s
2015-03-09 23:15:48,374: INFO: PROGRESS: at 0.18% words, alpha 0.02497, 83370 words/s
2015-03-09 23:15:51,398: INFO: PROGRESS: at 0.19% words, alpha 0.02496, 82794 words/s
2015-03-09 23:15:55,069: INFO: PROGRESS: at 0.21% words, alpha 0.02496, 83753 words/s
2015-03-09 23:15:57,718: INFO: PROGRESS: at 0.23% words, alpha 0.02496, 85031 words/s
2015-03-09 23:16:00,106: INFO: PROGRESS: at 0.24% words, alpha 0.02495, 86567 words/s
2015-03-09 23:16:05,523: INFO: PROGRESS: at 0.26% words, alpha 0.02495, 84850 words/s
2015-03-09 23:16:06,596: INFO: PROGRESS: at 0.27% words, alpha 0.02495, 87926 words/s
2015-03-09 23:16:09,500: INFO: PROGRESS: at 0.29% words, alpha 0.02494, 88618 words/s
2015-03-09 23:16:10,714: INFO: PROGRESS: at 0.30% words, alpha 0.02494, 91023 words/s
2015-03-09 23:16:18,467: INFO: PROGRESS: at 0.32% words, alpha 0.02494, 85960 words/s
2015-03-09 23:16:19,547: INFO: PROGRESS: at 0.33% words, alpha 0.02493, 89140 words/s
2015-03-09 23:16:23,500: INFO: PROGRESS: at 0.36% words, alpha 0.02493, 92026 words/s
2015-03-09 23:16:29,738: INFO: PROGRESS: at 0.37% words, alpha 0.02491, 88180 words/s
2015-03-09 23:16:32,000: INFO: PROGRESS: at 0.40% words, alpha 0.02492, 92734 words/s
2015-03-09 23:16:34,392: INFO: PROGRESS: at 0.42% words, alpha 0.02491, 93300 words/s
2015-03-09 23:16:41,018: INFO: PROGRESS: at 0.43% words, alpha 0.02490, 89727 words/s
.......
2015-03-10 05:03:31,849: INFO: PROGRESS: at 99.20% words, alpha 0.00020, 95350 words/s
2015-03-10 05:03:32,901: INFO: PROGRESS: at 99.21% words, alpha 0.00020, 95350 words/s
2015-03-10 05:03:34,296: INFO: PROGRESS: at 99.21% words, alpha 0.00020, 95350 words/s
2015-03-10 05:03:35,635: INFO: PROGRESS: at 99.22% words, alpha 0.00020, 95349 words/s
2015-03-10 05:03:36,730: INFO: PROGRESS: at 99.22% words, alpha 0.00020, 95350 words/s
2015-03-10 05:03:37,489: INFO: reached the end of input; waiting to finish 8 outstanding jobs
2015-03-10 05:03:37,908: INFO: PROGRESS: at 99.23% words, alpha 0.00019, 95350 words/s
2015-03-10 05:03:39,028: INFO: PROGRESS: at 99.23% words, alpha 0.00019, 95350 words/s
2015-03-10 05:03:40,127: INFO: PROGRESS: at 99.24% words, alpha 0.00019, 95350 words/s
2015-03-10 05:03:40,910: INFO: training on 1994415728 words took 20916.4s, 95352 words/s
2015-03-10 05:03:41,058: INFO: saving Word2Vec object under wiki.en.text.model, separately None
2015-03-10 05:03:41,209: INFO: not storing attribute syn0norm
2015-03-10 05:03:41,209: INFO: storing numpy array 'syn0' to wiki.en.text.model.syn0.npy
2015-03-10 05:04:35,199: INFO: storing numpy array 'syn1' to wiki.en.text.model.syn1.npy
2015-03-10 05:11:25,400: INFO: storing 1969354x400 projection weights into wiki.en.text.vector

After 7 hours, we get the text format word2vec model: wiki.en.text.vector:

1969354 400
the 0.129255 0.015725 0.049174 -0.016438 -0.018912 0.032752 0.079885 0.033669 -0.077722 -0.025709 0.012775 0.044153 0.134307 0.070499 -0.002243 0.105198 -0.016832 -0.028631 -0.124312 -0.123064 -0.116838 0.051181 -0.096058 -0.049734 0.017380 -0.101221 0.058945 0.013669 -0.012755 0.061053 0.061813 0.083655 -0.069382 -0.069868 0.066529 -0.037156 -0.072935 -0.009470 0.037412 -0.004406 0.047011 0.005033 -0.066270 -0.031815 0.023160 -0.080117 0.172918 0.065486 -0.072161 0.062875 0.019939 -0.048380 0.198152 -0.098525 0.023434 0.079439 0.045150 -0.079479 -0.051441 -0.021556 -0.024981 -0.045291 0.040284 -0.082500 0.014618 -0.071998 0.031887 0.043916 0.115783 -0.174898 0.086603 -0.023124 0.007293 -0.066576 -0.164817 -0.081223 0.058412 0.000132 0.064160 0.055848 0.029776 -0.103420 -0.007541 -0.031742 0.082533 -0.061760 -0.038961 0.001754 -0.023977 0.069616 0.095920 0.017136 0.067126 -0.111310 0.053632 0.017633 -0.003875 -0.005236 0.063151 0.039729 -0.039158 0.001415 0.021754 -0.012540 0.015070 -0.062636 -0.013605 -0.031770 0.005296 -0.078119 -0.069303 -0.080634 -0.058377 0.024398 -0.028173 0.026353 0.088662 0.018755 -0.113538 0.055538 -0.086012 -0.027708 -0.028788 0.017759 0.029293 0.047674 -0.106734 -0.134380 0.048605 -0.089583 0.029426 0.030552 0.141916 -0.022653 0.017204 -0.036059 0.061045 -0.000077 -0.076579 0.066747 0.060884 -0.072817 -0.051195 0.017663 0.043462 0.027486 -0.040694 0.025904 -0.075665 -0.000057 -0.076601 0.006704 -0.078985 -0.027770 -0.038087 0.097482 -0.001861 0.003741 -0.010897 0.042828 -0.037804 0.041546 -0.018394 -0.092459 0.010917 -0.004262 -0.113903 -0.037155 0.066674 0.096078 -0.114286 0.027908 -0.003139 -0.007529 -0.076928 0.025825 -0.090934 -0.013763 -0.057434 0.071827 -0.031783 -0.052096 0.107292 0.001864 -0.020808 0.043721 -0.024951 -0.046789 0.092858 0.037771 -0.006570 0.018282 -0.013571 -0.069215 0.019530 -0.080015 -0.078925 0.003094 0.044550 -0.046577 0.004945 -0.010885 -0.098681 0.044861 0.001618 -0.077582 -0.013834 0.024985 0.008541 -0.011861 0.023718 -0.018038 0.004162 -0.005827 -0.036836 0.081241 -0.028473 0.043937 0.005622 -0.004714 -0.029995 0.002236 -0.044635 -0.100051 0.006926 0.012636 -0.132891 -0.097755 -0.118586 0.038355 -0.034691 0.027983 0.074292 0.075199 0.033331 0.067474 -0.023996 0.024614 -0.039520 -0.110454 0.046004 -0.047849 0.023945 -0.022695 -0.053563 0.035277 0.011309 0.044326 0.026382 0.043251 0.004535 0.112228 0.022841 -0.068083 -0.122575 -0.053305 -0.005031 -0.078522 -0.044147 0.083576 0.005531 -0.063187 -0.032841 -0.067989 0.111359 0.125724 0.074154 0.040301 0.082240 0.015494 -0.066648 0.091087 0.095067 -0.059386 0.003256 -0.006734 -0.058248 0.020567 -0.006784 -0.017885 0.146956 -0.014679 -0.019453 -0.009875 -0.031508 0.002070 -0.002830 0.060321 0.056237 -0.080740 0.017465 0.016851 -0.067723 -0.061582 0.028104 0.067970 -0.024162 0.027407 0.075006 0.084483 -0.011534 0.129151 -0.072387 0.083424 -0.009501 0.041553 0.016603 0.002965 -0.027677 -0.110295 0.033986 0.028290 0.049621 0.001125 -0.018187 -0.001404 -0.024074 0.025322 -0.023594 -0.076071 0.107616 0.091381 -0.116943 0.109416 -0.045990 0.024346 0.152548 -0.010692 0.120887 -0.012670 -0.044978 -0.050880 -0.012535 -0.080475 0.036055 -0.050770 0.040417 -0.030957 -0.013680 0.001236 0.010180 -0.040136 -0.118249 0.017540 0.107725 -0.118492 -0.032438 -0.009072 -0.081345 -0.022384 0.045453 -0.008754 -0.098392 -0.113199 0.023589 0.017172 0.108523 -0.029611 0.041029 0.005958 0.010155 -0.036815 0.073110 -0.048424 -0.029022 -0.016711 -0.126587 0.045923 0.018589 0.113195 -0.002896 -0.051350 -0.007355 0.012278 0.093481 0.093676 -0.145230 -0.068279 -0.068407 0.008837 -0.012186 -0.136079 0.087961 0.041402 -0.058727 0.003030 0.008455 -0.062826 -0.139834 -0.014068 -0.115521 -0.117215 0.093502 0.026607 0.095726 -0.016339 0.033879 -0.022889 0.023565 0.028705

In the ipython, we test it like this, note that the wiki.en.text.vector is about 7G, and the load time cost long time:

In [2]: import gensim
 
In [3]: model = gensim.models.Word2Vec.load_word2vec_format("wiki.en.text.vector", binary=False)
 
In [4]: model.most_similar("queen")
Out[4]: 
[(u'princess', 0.5760838389396667),
 (u'hyoui', 0.5671186447143555),
 (u'janggyung', 0.5598698854446411),
 (u'king', 0.5556215047836304),
 (u'dollallolla', 0.5540223121643066),
 (u'loranella', 0.5522741079330444),
 (u'ramphaiphanni', 0.5310937166213989),
 (u'jeheon', 0.5298476219177246),
 (u'soheon', 0.5243583917617798),
 (u'coronation', 0.5217245221138)]
 
In [5]: model.most_similar("man")
Out[5]: 
[(u'woman', 0.7120707035064697),
 (u'girl', 0.58659827709198),
 (u'handsome', 0.5637181997299194),
 (u'boy', 0.5425317287445068),
 (u'villager', 0.5084836483001709),
 (u'mustachioed', 0.49287813901901245),
 (u'mcgucket', 0.48355430364608765),
 (u'spider', 0.4804879426956177),
 (u'policeman', 0.4780033826828003),
 (u'stranger', 0.4750771224498749)]
 
In [6]: model.most_similar("woman")
Out[6]: 
[(u'man', 0.7120705842971802),
 (u'girl', 0.6736541986465454),
 (u'prostitute', 0.5765659809112549),
 (u'divorcee', 0.5429972410202026),
 (u'person', 0.5276163816452026),
 (u'schoolgirl', 0.5102938413619995),
 (u'housewife', 0.48748138546943665),
 (u'lover', 0.4858251214027405),
 (u'handsome', 0.4773051142692566),
 (u'boy', 0.47445783019065857)]
 
In [8]: model.similarity("woman", "man")
Out[8]: 0.71207063453821218
 
In [10]: model.doesnt_match("breakfast cereal dinner lunch".split())
Out[10]: 'cereal'
 
In [11]: model.similarity("woman", "girl")
Out[11]: 0.67365416785207421
 
In [13]: model.most_similar("frog")
Out[13]: 
[(u'toad', 0.6868536472320557),
 (u'barycragus', 0.6607867479324341),
 (u'grylio', 0.626731276512146),
 (u'heckscheri', 0.6208407878875732),
 (u'clamitans', 0.6150864362716675),
 (u'coplandi', 0.612680196762085),
 (u'pseudacris', 0.6108512878417969),
 (u'litoria', 0.6084023714065552),
 (u'raniformis', 0.6044802665710449),
 (u'watjulumensis', 0.6043726205825806)]

Everything is ok, but when we load the numpy model, we still met the “RuntimeWarning: invalid value encountered in divide” problem:

In [1]: import gensim 
 
In [2]: model = gensim.models.Word2Vec.load("wiki.en.text.model")
 
In [3]: model.most_similar("man")
... RuntimeWarning: invalid value encountered in divide
  self.syn0norm = (self.syn0 / sqrt((self.syn0 ** 2).sum(-1))[..., newaxis]).astype(REAL)
 
Out[3]: 
[(u'ahsns', nan),
 (u'ny\xedl', nan),
 (u'indradeo', nan),
 (u'jaimovich', nan),
 (u'addlepate', nan),
 (u'jagello', nan),
 (u'festenburg', nan),
 (u'picatic', nan),
 (u'tolosanum', nan),
 (u'mithoo', nan)]

If you see here and resolve the problem I met, pls point out for me, thanks a lot.

Posted by TextMiner

Deep Learning Specialization on Coursera

Comments

Training Word2Vec Model on English Wikipedia by Gensim — 57 Comments

  1. Pingback: Applying GloVe Vectors using gensim word2vec | Nam Khanh Tran

  2. hello
    I am new in gensim,I can’t install and use it,I want to semantic clustering text data so I need to doc2vector then extract semantic words of documents,I read all of website about gensim but I don’t know how to use gensim and I’m confused,
    please recommend me a source that explain clearly step by step basically.

    • You can use Anaconda data science tools and install gensim inside it. Other ways is to use pip install gensim command in your terminal to install Gensim package.

  3. Instead of using the model in text format (which is very slow) it is better to convert the model to binary:

    model = word2vec.Word2Vec.load_word2vec_format(‘wiki.en.text.vector’, binary=false)
    model.save_word2vec_format(‘wiki.en.bin.vector’, binary=true)

    Best Regards

  4. Hi,
    I get error “print globals()[‘__doc__’] / locals()
    TypeError: unsupported operand type(s) for /: ‘NoneType’ and ‘dict'”.
    Please help me fix it.
    Thanks

  5. could you tell me how do i select the right number of iterations? or how do i know i need this many iterations for word vectors to get stabilized? many thanks.

  6. Hi,

    Is there a way to search for articles in Wikipedia XML Dump based on its Categories and the only extract those articles that belongs to a particular categories and to convert it to plain text? Thanks

      • #TextMiner @TextMiner
        plz help

        File “C:\Program Files\Python36-32\lib\site-packages\gensim\models\base_any2ve
        c.py”, line 601, in _check_training_sanity
        raise RuntimeError(“you must first build vocabulary before training the mode
        l”)
        RuntimeError: you must first build vocabulary before training the model

  7. When I try to create .text file from Farsi wikipedia it throws such exception:

    Process InputQueue-8:
    Traceback (most recent call last):
    File “/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/process.py”, line 258, in _bootstrap
    self.run()
    File “/Library/Python/2.7/site-packages/gensim-0.13.2-py2.7-macosx-10.11-intel.egg/gensim/utils.py”, line 822, in run
    wrapped_chunk = [list(chunk)]
    File “/Library/Python/2.7/site-packages/gensim-0.13.2-py2.7-macosx-10.11-intel.egg/gensim/corpora/wikicorpus.py”, line 298, in
    texts = ((text, self.lemmatize, title, pageid) for title, text, pageid in extract_pages(bz2.BZ2File(self.fname), self.filter_namespaces))
    File “/Library/Python/2.7/site-packages/gensim-0.13.2-py2.7-macosx-10.11-intel.egg/gensim/corpora/wikicorpus.py”, line 212, in extract_pages
    for elem in elems:
    File “/Library/Python/2.7/site-packages/gensim-0.13.2-py2.7-macosx-10.11-intel.egg/gensim/corpora/wikicorpus.py”, line 197, in
    elems = (elem for _, elem in iterparse(f, events=(“end”,)))
    File “”, line 107, in next
    ParseError: no element found: line 44, column 0

    did you whats the issue?

  8. In [1]: import gensim
    C:\ProgramData\Anaconda2\lib\site-packages\gensim\utils.py:860: UserWarning: detected Windows; alias
    ing chunkize to chunkize_serial
    warnings.warn(“detected Windows; aliasing chunkize to chunkize_serial”)

    In [2]: model = gensim.models.word2vec.Word2Vec.load(“F:\wikimedia\wiki.en.text.vector”)
    —————————————————————————
    UnpicklingError Traceback (most recent call last)
    in ()
    —-> 1 model = gensim.models.word2vec.Word2Vec.load(“F:\wikimedia\wiki.en.text.vector”)

    C:\ProgramData\Anaconda2\lib\site-packages\gensim\models\word2vec.pyc in load(cls, *args, **kwargs)
    1408 @classmethod
    1409 def load(cls, *args, **kwargs):
    -> 1410 model = super(Word2Vec, cls).load(*args, **kwargs)
    1411 # update older models
    1412 if hasattr(model, ‘table’):

    C:\ProgramData\Anaconda2\lib\site-packages\gensim\utils.pyc in load(cls, fname, mmap)
    269 compress, subname = SaveLoad._adapt_by_suffix(fname)
    270
    –> 271 obj = unpickle(fname)
    272 obj._load_specials(fname, mmap, compress, subname)
    273 logger.info(“loaded %s”, fname)

    C:\ProgramData\Anaconda2\lib\site-packages\gensim\utils.pyc in unpickle(fname)
    933 return _pickle.load(f, encoding=’latin1′)
    934 else:
    –> 935 return _pickle.loads(f.read())
    936
    937

    UnpicklingError: unpickling stack underflow
    please tell me How to solve the problem?

  9. Hey, first of all thank you for the tutorial.

    I, as a non-programmer, already have problems at the beginning:

    What do I exactly need to do to “Execute “python process_wiki.py enwiki-latest-pages-articles.xml.bz2 wiki.en.text” (under the first picture). Do I just paste this in the windows command line?

  10. Hi, I am using anaconda 3. I am trying to run this program in the internal terminal of anaconda, but every time I try to run the code it says no file or directory named process_wiki.py, but I am running it from the same directory where the program is stored.

    • sorry that was a silly mistake…
      this is the current problem…

      Traceback (most recent call last):
      File “process_wiki.py”, line 30, in
      output.write(b’ ‘.join(text).decode(‘utf-8’) + ‘\n’)
      TypeError: sequence item 0: expected a bytes-like object, str found

          • By adding this line I think I solved that problem “text = [x.encode(‘utf-8’) for x in text]”:

            for text in wiki.get_texts():
            text = [x.encode(‘utf-8′) for x in text]
            if six.PY3:
            output.write(b’ ‘.join(text).decode(‘utf-8’) + ‘\n’)

  11. Hello, May I know why you do not share the extracted model?
    I spent one day to extract the model but at the end, I encountered an error. It would be very helpful for all if you share your extracted model. Many thanks.

  12. Hi!

    I need to process the text, so I was using your process_wiki.py code. But it is not working and I can see no logging information even after several hours. Any suggestions?

  13. Thanks for the great sharing.

    Some experience to share when I run the code:
    1. encoding method: in my machine, I need to change line 1623 in word2vec.py(including class LineSentence) to `line = utils.to_unicode(line, errors=’ignore’).split()`, otherwise it will always complainging about `utf-8` encoding problem

    2. I manually split all the text into sentence no more than 990 words followed the `MAX_SENTENCE_LEN = 1000` definition and it works with the `model.most_similar(“queen”)`

    Hope it can provide some help

  14. Hi Radim,

    I think I have the solution to your problem with wikipedia text.
    I think what you are looking for is to decode special characters like french words.
    “`
    >>> from unidecode import unidecode
    >>> unidecode(u’北京’)
    ‘Bei Jing’
    >>> unidecode(u’Škoda’)
    ‘Skoda’

    “`
    Reference:
    https://stackoverflow.com/questions/2365411/python-convert-unicode-to-ascii-without-errors

    Let me know if I can provide a pr at the gensim repo

  15. Hi
    Thanks for your awesome tutorial.
    I guess to consider bigrams and trigrams, we should change this part of code:

    model = Word2Vec(LineSentence(inp), size=400, window=5, min_count=5,
    workers=multiprocessing.cpu_count())

    to this:
    common_terms = [“of”, “with”, “without”, “and”, “or”, “the”, “a”]
    bigram_transformer = gensim.models.Phrases(LineSentence(inp), common_terms=common_terms)
    trigram_transformer = gensim.models.Phrases(bigram_transformer[LineSentence(inp)], common_terms=common_terms)

    model = Word2Vec(trigram_transformer[LineSentence(inp)], size=400, window=5, min_count=5,
    workers=multiprocessing.cpu_count())

    I will try and will let you know if it works.

  16. I was trying to process the Wikipedia corpus…

    Tried both ways and variations of them to counter the errors shown, but nothing is working…

    The errors are following…

    Traceback (most recent call last):
    File “process_wiki.py”, line 30, in
    output.write(b’ ‘.join(text).decode(‘utf-8’) + ‘\n’)
    TypeError: sequence item 0: expected bytes, bytearray, or an object with the buffer interface, str found

    Traceback (most recent call last):
    File “process_wiki.py”, line 30, in
    output.write(str(b”.decode(‘utf-8’)).join(text) + ‘\n’)
    File “D:\Python34\lib\encodings\cp1252.py”, line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
    UnicodeEncodeError: ‘charmap’ codec can’t encode characters in position 1266-1267: character maps to

    Traceback (most recent call last):
    File “process_wiki.py”, line 32, in
    output.write(space.join(map(lambda x:x.decode(“utf-8”), text)) + ‘\n’)
    File “process_wiki.py”, line 32, in
    output.write(space.join(map(lambda x:x.decode(“utf-8”), text)) + ‘\n’)
    AttributeError: ‘str’ object has no attribute ‘decode’

    Traceback (most recent call last):
    File “process_wiki.py”, line 32, in
    output.write(space.join(map(lambda x:x, text)) + ‘\n’)
    File “D:\Python34\lib\encodings\cp1252.py”, line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
    UnicodeEncodeError: ‘charmap’ codec can’t encode characters in position 1470-1471: character maps to

  17. File “C:\Program Files\Python36-32\lib\site-packages\gensim\models\base_any2ve
    c.py”, line 601, in _check_training_sanity
    raise RuntimeError(“you must first build vocabulary before training the mode
    l”)
    RuntimeError: you must first build vocabulary before training the model

  18. What is the difference in the models in out1 and out2 being saved here:

    “`model.save(outp1)
    model.save_word2vec_format(outp2, binary=False)“`

  19. How to create own corpus and use that for Doc2vec?
    train file and test file is available in local drive .
    how to access these two files using gensim?

    • After download wiki zip file, you can run it anywhere, where the process_wiki.py script and the wiki bz data are in same directory: python process_wiki.py enwiki-latest-pages-articles.xml.bz2 wiki.en.text

  20. Hi, First of thank you for this wonderful indicative taken. I am trying to run script “train_word2vec_model.py” (Modified one) after 20 Minute of running it is giving me error: MemoryError: unable to allocate array with shape and data type float32.

Leave a Reply to somatic Cancel reply

Your email address will not be published.