HomeNLPExploiting Wikipedia Word Similarity by Word2Vec

We have written “Training Word2Vec Model on English Wikipedia by Gensim” before, and got a lot of attention. Recently, I have reviewed Word2Vec related materials again and test a new method to process the English wikipedia data and train Word2Vec model on it by gensim, the model is used to compute the word similarity. For word2vec, I recommended the “Getting started with Word2Vec” frist.

This time we still use the latest English Wikipedia dump xml data: https://dumps.wikimedia.org/enwiki/latest/ , download the “enwiki-latest-pages-articles.xml.bz2“, the data was dumped on April 4, 2017.

Different from “Training Word2Vec Model on English Wikipedia by Gensim” , we use WikiExtractor to preprocessing wikipedia data:

“WikiExtractor.py is a Python script that extracts and cleans text from a Wikipedia database dump.
The tool is written in Python and requires Python 2.7 or Python 3.3+ but no additional library.”

Install WikiExtractor:

git clone https://github.com/attardi/wikiextractor.git
cd wikiextractor/
sudo python setup.py install

Using WikiExtractor
WikiExtractor.py -o enwiki enwiki-latest-pages-articles.xml.bz2

......
INFO: 53665431  Pampapaul
INFO: 53665433  Charles Frederick Zimpel
INFO: Finished 11-process extraction of 5375019 articles in 8363.5s (642.7 art/s)

It took about 2.5 hours to process the dump wikipedia data, and the processed wiki data is divided into many parts on sub diretories:

The content in wiki_00 like this:

<doc id="12" url="https://en.wikipedia.org/wiki?curid=12" title="Anarchism">
Anarchism
 
Anarchism is a political philosophy that advocates self-governed societies based on voluntary institutions. These are often described as stateless societies, although several authors have defined them more specifically as institutions based on non-hierarchical free associations. Anarchism holds the state to be undesirable, unnecessary, and harmful.
...
Criticisms of anarchism include moral criticisms and pragmatic criticisms. Anarchism is often evaluated as unfeasible or utopian by its critics.
 
 
</doc>
<doc id="25" url="https://en.wikipedia.org/wiki?curid=25" title="Autism">
Autism
 
Autism is a neurodevelopmental disorder characterized by impaired social interaction, verbal and non-verbal communication, and restricted and repetitive behavior. Parents usually notice signs in the first two years of their child's life. These signs often develop gradually, though some children with autism reach their developmental milestones at a normal pace and then regress. The diagnostic criteria require that symptoms become apparent in early childhood, typically before age three.
...
</doc>
...

I have written a python script “train_word2vec_with_gensim.py” which follow the gensim word2vec “memory-friendly iterator” method:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# Author: Pan Yang (panyangnlp@gmail.com)
# Copyright 2017 @ Yu Zhen
 
import gensim
import logging
import multiprocessing
import os
import re
import sys
 
from pattern.en import tokenize
from time import time
 
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s',
                    level=logging.INFO)
 
 
def cleanhtml(raw_html):
    cleanr = re.compile('<.*?>')
    cleantext = re.sub(cleanr, ' ', raw_html)
    return cleantext
 
 
class MySentences(object):
    def __init__(self, dirname):
        self.dirname = dirname
 
    def __iter__(self):
        for root, dirs, files in os.walk(self.dirname):
            for filename in files:
                file_path = root + '/' + filename
                for line in open(file_path):
                    sline = line.strip()
                    if sline == "":
                        continue
                    rline = cleanhtml(sline)
                    tokenized_line = ' '.join(tokenize(rline))
                    is_alpha_word_line = [word for word in
                                          tokenized_line.lower().split()
                                          if word.isalpha()]
                    yield is_alpha_word_line
 
 
if __name__ == '__main__':
    if len(sys.argv) != 2:
        print "Please use python train_with_gensim.py data_path"
        exit()
    data_path = sys.argv[1]
    begin = time()
 
    sentences = MySentences(data_path)
    model = gensim.models.Word2Vec(sentences,
                                   size=200,
                                   window=10,
                                   min_count=10,
                                   workers=multiprocessing.cpu_count())
    model.save("data/model/word2vec_gensim")
    model.wv.save_word2vec_format("data/model/word2vec_org",
                                  "data/model/vocabulary",
                                  binary=False)
 
    end = time()
    print "Total procesing time: %d seconds" % (end - begin)

Note that the word tokenize is used with patten english tokenize module, and certainly, you can use nltk word tokenize or other english word tokenize module. Now we can use this script to train a word2vec model on the full English wikipedia dataļ¼š

python train_word2vec_with_gensim.py enwiki

2017-04-22 14:31:04,703 : INFO : collecting all words and their counts
2017-04-22 14:31:04,704 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2017-04-22 14:31:06,442 : INFO : PROGRESS: at sentence #10000, processed 480546 words, keeping 33925 word types
2017-04-22 14:31:08,104 : INFO : PROGRESS: at sentence #20000, processed 983240 words, keeping 51765 word types
2017-04-22 14:31:09,685 : INFO : PROGRESS: at sentence #30000, processed 1455218 words, keeping 64982 word types
2017-04-22 14:31:11,349 : INFO : PROGRESS: at sentence #40000, processed 1957479 words, keeping 76112 word types
......
2017-04-23 02:50:59,844 : INFO : worker thread finished; awaiting finish of 2 more threads                                                                      2017-04-23 02:50:59,844 : INFO : worker thread finished; awaiting finish of 1 more threads                                                                      2017-04-23 02:50:59,854 : INFO : worker thread finished; awaiting finish of 0 more threads                                                                      2017-04-23 02:50:59,854 : INFO : training on 8903084745 raw words (6742578791 effective words) took 37805.2s, 178351 effective words/s                          
2017-04-23 02:50:59,855 : INFO : saving Word2Vec object under data/model/word2vec_gensim, separately None                                                       
2017-04-23 02:50:59,855 : INFO : not storing attribute syn0norm                 
2017-04-23 02:50:59,855 : INFO : storing np array 'syn0' to data/model/word2vec_gensim.wv.syn0.npy                                                              
2017-04-23 02:51:00,241 : INFO : storing np array 'syn1neg' to data/model/word2vec_gensim.syn1neg.npy                                                           
2017-04-23 02:51:00,574 : INFO : not storing attribute cum_table                
2017-04-23 02:51:13,886 : INFO : saved data/model/word2vec_gensim               
2017-04-23 02:51:13,886 : INFO : storing vocabulary in data/model/vocabulary    
2017-04-23 02:51:17,480 : INFO : storing 868777x200 projection weights into data/model/word2vec_org                                                             
Total procesing time: 44476 seconds

This step costs more than 12 hours, and the trained model and vocabulary is saved in data/model directory. Let’s test them on ipython:

textminer@textminer:/opt/wiki/data$ ipython
Python 2.7.12 (default, Nov 19 2016, 06:48:10) 
Type "copyright", "credits" or "license" for more information.
 
IPython 2.4.1 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.
In [1]: from gensim.models import Word2Vec
 
In [2]: word2vec_model_4_en_wiki = Word2Vec.load('data/model/word2vec_gensim')

Word Similarity:

queen:

In [3]: word2vec_model_4_en_wiki.most_similar('queen')
Out[3]: 
[('princess', 0.7473609447479248),
 ('king', 0.6810176372528076),
 ('empress', 0.6455642580986023),
 ('duchess', 0.6144236326217651),
 ('prince', 0.6126301288604736),
 ('monarch', 0.6115533709526062),
 ('coronation', 0.603646457195282),
 ('isabella', 0.5997427701950073),
 ('regent', 0.5933823585510254),
 ('consort', 0.5885361433029175)]

king:

In [4]: word2vec_model_4_en_wiki.most_similar('king')
Out[4]: 
[('prince', 0.7479317784309387),
 ('queen', 0.6810176372528076),
 ('throne', 0.6729530096054077),
 ('monarch', 0.6692918539047241),
 ('ruler', 0.6659754514694214),
 ('emperor', 0.6649527549743652),
 ('ethelred', 0.661082923412323),
 ('duke', 0.6508696675300598),
 ('conqueror', 0.6439772844314575),
 ('regent', 0.6397967338562012)]

frog:

In [5]: word2vec_model_4_en_wiki.most_similar('frog')
Out[5]: 
[('toad', 0.7864628434181213),
 ('frogs', 0.7199467420578003),
 ('treefrog', 0.7113776206970215),
 ('snake', 0.6911556720733643),
 ('salamander', 0.6829583048820496),
 ('lizard', 0.672086238861084),
 ('toads', 0.6502566337585449),
 ('keelback', 0.6469823718070984),
 ('toadlet', 0.6440475583076477),
 ('limnodynastes', 0.6439393758773804)]

miner:

In [6]: word2vec_model_4_en_wiki.most_similar('miner')
Out[6]: 
[('prospector', 0.6998875141143799),
 ('steelworker', 0.644964337348938),
 ('coalminer', 0.6333279609680176),
 ('miners', 0.6274839639663696),
 ('farmer', 0.6165066957473755),
 ('heaver', 0.6061303615570068),
 ('laborer', 0.600059986114502),
 ('labourer', 0.5963433980941772),
 ('shoemaker', 0.5935906171798706),
 ('farmhand', 0.5898569226264954)]

mining:

In [7]: word2vec_model_4_en_wiki.most_similar('mining')
Out[7]: 
[('mines', 0.7691237926483154),
 ('prospecting', 0.7258275151252747),
 ('smelting', 0.721427321434021),
 ('bauxite', 0.6960649490356445),
 ('fluorspar', 0.6956771016120911),
 ('kennecott', 0.6944512128829956),
 ('mine', 0.6942962408065796),
 ('coal', 0.6893351078033447),
 ('ore', 0.6886317729949951),
 ('quarrying', 0.6798128485679626)]

text:

In [8]: word2vec_model_4_en_wiki.most_similar('text')
Out[8]: 
[('texts', 0.7537323236465454),
 ('annotations', 0.7100197672843933),
 ('footnotes', 0.675033688545227),
 ('quotations', 0.6624652743339539),
 ('manuscript', 0.6616512537002563),
 ('quotation', 0.6500091552734375),
 ('verse', 0.647144615650177),
 ('translation', 0.6458534002304077),
 ('postscript', 0.6436001658439636),
 ('scripture', 0.6365185976028442)]

In [9]: word2vec_model_4_en_wiki.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)
Out[9]: [('queen', 0.7752252817153931)]
 
In [10]: word2vec_model_4_en_wiki.similarity('man', 'woman')
Out[10]: 0.72430799548282099
 
In [11]: word2vec_model_4_en_wiki.doesnt_match("breakfast cereal dinner lunch".split())
Out[11]: 'cereal'

I have launched a github project “Wikipedia_Word2vec“, which host the code for this article and “Training Word2Vec Model on English Wikipedia by Gensim“, just enjoy it.

Posted by TextMiner


Comments

Exploiting Wikipedia Word Similarity by Word2Vec — 1 Comment

Leave a Reply

Your email address will not be published. Required fields are marked *