Exploiting Wikipedia Word Similarity by Word2Vec
We have written “Training Word2Vec Model on English Wikipedia by Gensim” before, and got a lot of attention. Recently, I have reviewed Word2Vec related materials again and test a new method to process the English wikipedia data and train Word2Vec model on it by gensim, the model is used to compute the word similarity. For word2vec, I recommended the “Getting started with Word2Vec” frist.
This time we still use the latest English Wikipedia dump xml data: https://dumps.wikimedia.org/enwiki/latest/ , download the “enwiki-latest-pages-articles.xml.bz2“, the data was dumped on April 4, 2017.
Different from “Training Word2Vec Model on English Wikipedia by Gensim” , we use WikiExtractor to preprocessing wikipedia data:
“WikiExtractor.py is a Python script that extracts and cleans text from a Wikipedia database dump.
The tool is written in Python and requires Python 2.7 or Python 3.3+ but no additional library.”
Install WikiExtractor:
git clone https://github.com/attardi/wikiextractor.git
cd wikiextractor/
sudo python setup.py install
Using WikiExtractor
WikiExtractor.py -o enwiki enwiki-latest-pages-articles.xml.bz2
...... INFO: 53665431 Pampapaul INFO: 53665433 Charles Frederick Zimpel INFO: Finished 11-process extraction of 5375019 articles in 8363.5s (642.7 art/s) |
It took about 2.5 hours to process the dump wikipedia data, and the processed wiki data is divided into many parts on sub diretories:
The content in wiki_00 like this:
<doc id="12" url="https://en.wikipedia.org/wiki?curid=12" title="Anarchism"> Anarchism Anarchism is a political philosophy that advocates self-governed societies based on voluntary institutions. These are often described as stateless societies, although several authors have defined them more specifically as institutions based on non-hierarchical free associations. Anarchism holds the state to be undesirable, unnecessary, and harmful. ... Criticisms of anarchism include moral criticisms and pragmatic criticisms. Anarchism is often evaluated as unfeasible or utopian by its critics. </doc> <doc id="25" url="https://en.wikipedia.org/wiki?curid=25" title="Autism"> Autism Autism is a neurodevelopmental disorder characterized by impaired social interaction, verbal and non-verbal communication, and restricted and repetitive behavior. Parents usually notice signs in the first two years of their child's life. These signs often develop gradually, though some children with autism reach their developmental milestones at a normal pace and then regress. The diagnostic criteria require that symptoms become apparent in early childhood, typically before age three. ... </doc> ... |
I have written a python script “train_word2vec_with_gensim.py” which follow the gensim word2vec “memory-friendly iterator” method:
#!/usr/bin/env python # -*- coding: utf-8 -*- # Author: Pan Yang (panyangnlp@gmail.com) # Copyright 2017 @ Yu Zhen import gensim import logging import multiprocessing import os import re import sys from pattern.en import tokenize from time import time logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO) def cleanhtml(raw_html): cleanr = re.compile('<.*?>') cleantext = re.sub(cleanr, ' ', raw_html) return cleantext class MySentences(object): def __init__(self, dirname): self.dirname = dirname def __iter__(self): for root, dirs, files in os.walk(self.dirname): for filename in files: file_path = root + '/' + filename for line in open(file_path): sline = line.strip() if sline == "": continue rline = cleanhtml(sline) tokenized_line = ' '.join(tokenize(rline)) is_alpha_word_line = [word for word in tokenized_line.lower().split() if word.isalpha()] yield is_alpha_word_line if __name__ == '__main__': if len(sys.argv) != 2: print "Please use python train_with_gensim.py data_path" exit() data_path = sys.argv[1] begin = time() sentences = MySentences(data_path) model = gensim.models.Word2Vec(sentences, size=200, window=10, min_count=10, workers=multiprocessing.cpu_count()) model.save("data/model/word2vec_gensim") model.wv.save_word2vec_format("data/model/word2vec_org", "data/model/vocabulary", binary=False) end = time() print "Total procesing time: %d seconds" % (end - begin) |
Note that the word tokenize is used with patten english tokenize module, and certainly, you can use nltk word tokenize or other english word tokenize module. Now we can use this script to train a word2vec model on the full English wikipedia dataļ¼
python train_word2vec_with_gensim.py enwiki
2017-04-22 14:31:04,703 : INFO : collecting all words and their counts 2017-04-22 14:31:04,704 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types 2017-04-22 14:31:06,442 : INFO : PROGRESS: at sentence #10000, processed 480546 words, keeping 33925 word types 2017-04-22 14:31:08,104 : INFO : PROGRESS: at sentence #20000, processed 983240 words, keeping 51765 word types 2017-04-22 14:31:09,685 : INFO : PROGRESS: at sentence #30000, processed 1455218 words, keeping 64982 word types 2017-04-22 14:31:11,349 : INFO : PROGRESS: at sentence #40000, processed 1957479 words, keeping 76112 word types ...... 2017-04-23 02:50:59,844 : INFO : worker thread finished; awaiting finish of 2 more threads 2017-04-23 02:50:59,844 : INFO : worker thread finished; awaiting finish of 1 more threads 2017-04-23 02:50:59,854 : INFO : worker thread finished; awaiting finish of 0 more threads 2017-04-23 02:50:59,854 : INFO : training on 8903084745 raw words (6742578791 effective words) took 37805.2s, 178351 effective words/s 2017-04-23 02:50:59,855 : INFO : saving Word2Vec object under data/model/word2vec_gensim, separately None 2017-04-23 02:50:59,855 : INFO : not storing attribute syn0norm 2017-04-23 02:50:59,855 : INFO : storing np array 'syn0' to data/model/word2vec_gensim.wv.syn0.npy 2017-04-23 02:51:00,241 : INFO : storing np array 'syn1neg' to data/model/word2vec_gensim.syn1neg.npy 2017-04-23 02:51:00,574 : INFO : not storing attribute cum_table 2017-04-23 02:51:13,886 : INFO : saved data/model/word2vec_gensim 2017-04-23 02:51:13,886 : INFO : storing vocabulary in data/model/vocabulary 2017-04-23 02:51:17,480 : INFO : storing 868777x200 projection weights into data/model/word2vec_org Total procesing time: 44476 seconds |
This step costs more than 12 hours, and the trained model and vocabulary is saved in data/model directory. Let’s test them on ipython:
textminer@textminer:/opt/wiki/data$ ipython Python 2.7.12 (default, Nov 19 2016, 06:48:10) Type "copyright", "credits" or "license" for more information. IPython 2.4.1 -- An enhanced Interactive Python. ? -> Introduction and overview of IPython's features. %quickref -> Quick reference. help -> Python's own help system. object? -> Details about 'object', use 'object??' for extra details. In [1]: from gensim.models import Word2Vec In [2]: word2vec_model_4_en_wiki = Word2Vec.load('data/model/word2vec_gensim') |
In [3]: word2vec_model_4_en_wiki.most_similar('queen') Out[3]: [('princess', 0.7473609447479248), ('king', 0.6810176372528076), ('empress', 0.6455642580986023), ('duchess', 0.6144236326217651), ('prince', 0.6126301288604736), ('monarch', 0.6115533709526062), ('coronation', 0.603646457195282), ('isabella', 0.5997427701950073), ('regent', 0.5933823585510254), ('consort', 0.5885361433029175)] |
king:
In [4]: word2vec_model_4_en_wiki.most_similar('king') Out[4]: [('prince', 0.7479317784309387), ('queen', 0.6810176372528076), ('throne', 0.6729530096054077), ('monarch', 0.6692918539047241), ('ruler', 0.6659754514694214), ('emperor', 0.6649527549743652), ('ethelred', 0.661082923412323), ('duke', 0.6508696675300598), ('conqueror', 0.6439772844314575), ('regent', 0.6397967338562012)] |
frog:
In [5]: word2vec_model_4_en_wiki.most_similar('frog') Out[5]: [('toad', 0.7864628434181213), ('frogs', 0.7199467420578003), ('treefrog', 0.7113776206970215), ('snake', 0.6911556720733643), ('salamander', 0.6829583048820496), ('lizard', 0.672086238861084), ('toads', 0.6502566337585449), ('keelback', 0.6469823718070984), ('toadlet', 0.6440475583076477), ('limnodynastes', 0.6439393758773804)] |
In [6]: word2vec_model_4_en_wiki.most_similar('miner') Out[6]: [('prospector', 0.6998875141143799), ('steelworker', 0.644964337348938), ('coalminer', 0.6333279609680176), ('miners', 0.6274839639663696), ('farmer', 0.6165066957473755), ('heaver', 0.6061303615570068), ('laborer', 0.600059986114502), ('labourer', 0.5963433980941772), ('shoemaker', 0.5935906171798706), ('farmhand', 0.5898569226264954)] |
In [7]: word2vec_model_4_en_wiki.most_similar('mining') Out[7]: [('mines', 0.7691237926483154), ('prospecting', 0.7258275151252747), ('smelting', 0.721427321434021), ('bauxite', 0.6960649490356445), ('fluorspar', 0.6956771016120911), ('kennecott', 0.6944512128829956), ('mine', 0.6942962408065796), ('coal', 0.6893351078033447), ('ore', 0.6886317729949951), ('quarrying', 0.6798128485679626)] |
text:
In [8]: word2vec_model_4_en_wiki.most_similar('text') Out[8]: [('texts', 0.7537323236465454), ('annotations', 0.7100197672843933), ('footnotes', 0.675033688545227), ('quotations', 0.6624652743339539), ('manuscript', 0.6616512537002563), ('quotation', 0.6500091552734375), ('verse', 0.647144615650177), ('translation', 0.6458534002304077), ('postscript', 0.6436001658439636), ('scripture', 0.6365185976028442)] |
In [9]: word2vec_model_4_en_wiki.most_similar(positive=['woman', 'king'], negative=['man'], topn=1) Out[9]: [('queen', 0.7752252817153931)] In [10]: word2vec_model_4_en_wiki.similarity('man', 'woman') Out[10]: 0.72430799548282099 In [11]: word2vec_model_4_en_wiki.doesnt_match("breakfast cereal dinner lunch".split()) Out[11]: 'cereal' |
I have launched a github project “Wikipedia_Word2vec“, which host the code for this article and “Training Word2Vec Model on English Wikipedia by Gensim“, just enjoy it.
Posted by TextMiner
Amazing and productive work!
Does your code save all the parsing previously doing the training? I have an issue with gensim in python 2.7, it only runs in slow mode. I tried to change the code to python 3 using nltk instead of pattern and changing the print, but it doesn’t run. SO i thinking in running this in two steps. The parsing in python 2 and the training in python 3.
Thanks
Have you installed Cython? If you are not install Cython, the training time maybe slow.
Yeah. I have installed it. I try everything online. Installing gensim with pip, uninstall again. Install with conda. I don’t know why it doesn’t work.
may i know how dose you plot results images? that very cool i think!
d3.js