HomeDeep LearningDive Into NLTK, Part X: Play with Word2Vec Models based on NLTK Corpus
Deep Learning Specialization on Coursera

This is the tenth article in the series “Dive Into NLTK“, here is an index of all the articles in the series that have been published to date:

Part I: Getting Started with NLTK
Part II: Sentence Tokenize and Word Tokenize
Part III: Part-Of-Speech Tagging and POS Tagger
Part IV: Stemming and Lemmatization
Part V: Using Stanford Text Analysis Tools in Python
Part VI: Add Stanford Word Segmenter Interface for Python NLTK
Part VII: A Preliminary Study on Text Classification
Part VIII: Using External Maximum Entropy Modeling Libraries for Text Classification
Part IX: From Text Classification to Sentiment Analysis
Part X: Play With Word2Vec Models based on NLTK Corpus (This article)

NLTK Corpus

Accessing Text Corpora in NLTK is very easily. NLTK provides a NLTK Corpus Package to read and manage the corpus data. For example, NLTK includes a small selection of texts from the Project Gutenberg electronic text archive, which name Gutenberg Corpus. About Project Gutenberg:

Project Gutenberg (PG) is a volunteer effort to digitize and archive cultural works, to “encourage the creation and distribution of eBooks”. It was founded in 1971 by Michael S. Hart and is the oldest digital library. Most of the items in its collection are the full texts of public domain books. The project tries to make these as free as possible, in long-lasting, open formats that can be used on almost any computer. As of 3 October 2015, Project Gutenberg reached 50,000 items in its collection.

We can list the ebook file name of the Gutenberg Corpus in NLTK like this:

Python 2.7.6 (default, Jun  3 2014, 07:43:23) 
Type "copyright", "credits" or "license" for more information.
 
IPython 3.1.0 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.
 
In [1]: from nltk.corpus import gutenberg
 
In [2]: gutenberg.fileids()
Out[2]: 
[u'austen-emma.txt',
 u'austen-persuasion.txt',
 u'austen-sense.txt',
 u'bible-kjv.txt',
 u'blake-poems.txt',
 u'bryant-stories.txt',
 u'burgess-busterbrown.txt',
 u'carroll-alice.txt',
 u'chesterton-ball.txt',
 u'chesterton-brown.txt',
 u'chesterton-thursday.txt',
 u'edgeworth-parents.txt',
 u'melville-moby_dick.txt',
 u'milton-paradise.txt',
 u'shakespeare-caesar.txt',
 u'shakespeare-hamlet.txt',
 u'shakespeare-macbeth.txt',
 u'whitman-leaves.txt']
 
In [3]: austen_emma_words = gutenberg.words('austen-emma.txt')
 
In [4]: len(austen_emma_words)
Out[4]: 192427
 
In [5]: austen_emma_sents = gutenberg.sents('austen-emma.txt')
 
In [6]: len(austen_emma_sents)
Out[6]: 9111
 
In [7]: austen_emma_sents[0]
Out[7]: [u'[', u'Emma', u'by', u'Jane', u'Austen', u'1816', u']']
 
In [8]: austen_emma_sents[5000]
Out[8]: 
[u'I',
 u'know',
 u'the',
 u'danger',
 u'of',
 u'indulging',
 u'such',
 u'speculations',
 u'.']

Word2Vec

Narrowly speaking, the Word2Vec we said here is referred to Google Word2Vec Project, which first proposed by Tomas Mikolov and etc in 2013. For more Word2Vec related papers, tutorials, and coding examples, we recommend the “Getting started with Word2Vec” by TextProcessing.

Word2Vec in Python

The best python implementation of Word2Vec is Gensim Word2vec module: models.word2vec – Deep learning with word2vec. We have written an article about “Word2Vec in Python“, you can reference it first if you have no idea about gensim word2vec model: Getting Started with Word2Vec and GloVe in Python.

Bible Word2Vec Model

The Gutenberg Corpus in NLTK include a file named ‘bible-kjv.txt’, which is “The King James Version of the Bible“:

The King James Version (KJV), also known as Authorized [sic] Version (AV) or simply King James Bible (KJB), is an English translation of the Christian Bible for the Church of England begun in 1604 and completed in 1611.[a] The books of the King James Version include the 39 books of the Old Testament, an intertestamental section containing 14 books of the Apocrypha, and the 27 books of the New Testament.

Let’s train a Bible Word2Vec Model based on bible-kjv.txt corpus and gensim word2vec module:

In [14]: import logging
 
In [15]: from gensim.models import word2vec
 
In [16]: logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
 
In [17]: bible_kjv_words = gutenberg.words('bible-kjv.txt')
 
In [18]: len(bible_kjv_words)
Out[18]: 1010654
 
In [19]: bible_kjv_sents = gutenberg.sents('bible-kjv.txt')  
 
In [20]: len(bible_kjv_sents)
Out[20]: 30103
 
In [21]: bible_kjv_sents[0]
Out[21]: [u'[', u'The', u'King', u'James', u'Bible', u']']
 
In [22]: bible_kjv_sents[1]
Out[22]: [u'The', u'Old', u'Testament', u'of', u'the', u'King', u'James', u'Bible']
 
In [23]: bible_kjv_sents[2]
Out[23]: [u'The', u'First', u'Book', u'of', u'Moses', u':', u'Called', u'Genesis']
 
In [24]: bible_kjv_sents[3]
Out[24]: 
[u'1',
 u':',
 u'1',
 u'In',
 u'the',
 u'beginning',
 u'God',
 u'created',
 u'the',
 u'heaven',
 u'and',
 u'the',
 u'earth',
 u'.']
 
In [25]: from string import punctuation
 
In [27]: punctuation
Out[27]: '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
 
# The words in nltk corpus has been word tokenized, so we just discard the punctuation from the words
# and lowercased the words.
In [29]: discard_punctuation_and_lowercased_sents = [[word.lower() for word in sent if word not in punctuation] for sent in bible_kjv_sents]
 
In [30]: discard_punctuation_and_lowercased_sents[0]
Out[30]: [u'the', u'king', u'james', u'bible']
 
In [31]: discard_punctuation_and_lowercased_sents[1]
Out[31]: [u'the', u'old', u'testament', u'of', u'the', u'king', u'james', u'bible']
 
In [32]: discard_punctuation_and_lowercased_sents[2]
Out[32]: [u'the', u'first', u'book', u'of', u'moses', u'called', u'genesis']
 
In [33]: discard_punctuation_and_lowercased_sents[3]
Out[33]: 
[u'1',
 u'1',
 u'in',
 u'the',
 u'beginning',
 u'god',
 u'created',
 u'the',
 u'heaven',
 u'and',
 u'the',
 u'earth']
 
In [34]: bible_kjv_word2vec_model = word2vec.Word2Vec(discard_punctuation_and_lowercased_sents, min_count=5, size=200)
2017-03-26 21:05:20,811 : INFO : collecting all words and their counts
2017-03-26 21:05:20,811 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2017-03-26 21:05:20,972 : INFO : PROGRESS: at sentence #10000, processed 315237 words, keeping 7112 word types
2017-03-26 21:05:21,103 : INFO : PROGRESS: at sentence #20000, processed 572536 words, keeping 10326 word types
2017-03-26 21:05:21,247 : INFO : PROGRESS: at sentence #30000, processed 851126 words, keeping 12738 word types
2017-03-26 21:05:21,249 : INFO : collected 12752 word types from a corpus of 854209 raw words and 30103 sentences
2017-03-26 21:05:21,249 : INFO : Loading a fresh vocabulary
2017-03-26 21:05:21,441 : INFO : min_count=5 retains 5428 unique words (42% of original 12752, drops 7324)
2017-03-26 21:05:21,441 : INFO : min_count=5 leaves 841306 word corpus (98% of original 854209, drops 12903)
2017-03-26 21:05:21,484 : INFO : deleting the raw counts dictionary of 12752 items
2017-03-26 21:05:21,485 : INFO : sample=0.001 downsamples 62 most-common words
2017-03-26 21:05:21,485 : INFO : downsampling leaves estimated 583788 word corpus (69.4% of prior 841306)
2017-03-26 21:05:21,485 : INFO : estimated required memory for 5428 words and 200 dimensions: 11398800 bytes
2017-03-26 21:05:21,520 : INFO : resetting layer weights
2017-03-26 21:05:21,708 : INFO : training model with 3 workers on 5428 vocabulary and 200 features, using sg=0 hs=0 sample=0.001 negative=5 window=5
2017-03-26 21:05:21,708 : INFO : expecting 30103 sentences, matching count from corpus used for vocabulary survey
2017-03-26 21:05:22,721 : INFO : PROGRESS: at 16.10% examples, 474025 words/s, in_qsize 5, out_qsize 0
2017-03-26 21:05:23,728 : INFO : PROGRESS: at 34.20% examples, 500893 words/s, in_qsize 5, out_qsize 0
2017-03-26 21:05:24,734 : INFO : PROGRESS: at 49.48% examples, 482782 words/s, in_qsize 5, out_qsize 0
2017-03-26 21:05:25,742 : INFO : PROGRESS: at 60.97% examples, 442365 words/s, in_qsize 5, out_qsize 0
2017-03-26 21:05:26,758 : INFO : PROGRESS: at 76.39% examples, 443206 words/s, in_qsize 6, out_qsize 0
2017-03-26 21:05:27,770 : INFO : PROGRESS: at 95.14% examples, 460213 words/s, in_qsize 5, out_qsize 0
2017-03-26 21:05:28,002 : INFO : worker thread finished; awaiting finish of 2 more threads
2017-03-26 21:05:28,007 : INFO : worker thread finished; awaiting finish of 1 more threads
2017-03-26 21:05:28,013 : INFO : worker thread finished; awaiting finish of 0 more threads
2017-03-26 21:05:28,013 : INFO : training on 4271045 raw words (2918320 effective words) took 6.3s, 463711 effective words/s 
 
In [35]: bible_kjv_word2vec_model.save("bible_word2vec_gensim")
2017-03-26 21:06:03,500 : INFO : saving Word2Vec object under bible_word2vec_gensim, separately None
2017-03-26 21:06:03,501 : INFO : not storing attribute syn0norm
2017-03-26 21:06:03,501 : INFO : not storing attribute cum_table
2017-03-26 21:06:03,646 : INFO : saved bible_word2vec_gensim
 
In [36]: bible_kjv_word2vec_model.wv.save_word2vec_format("bible_word2vec_org", "bible_word2vec_vocabulary")
2017-03-26 21:06:51,136 : INFO : storing vocabulary in bible_word2vec_vocabulary
2017-03-26 21:06:51,186 : INFO : storing 5428x200 projection weights into bible_word2vec_org

We have trained the bible word2vec model from the nltk bible-kjv corpus, now we can test it with some word.

First, “God”:

In [37]: bible_kjv_word2vec_model.most_similar(["god"])
2017-03-26 21:14:27,320 : INFO : precomputing L2-norms of word weight vectors
Out[37]: 
[(u'christ', 0.7863791584968567),
 (u'lord', 0.7807695865631104),
 (u'salvation', 0.772181510925293),
 (u'truth', 0.7689207792282104),
 (u'spirit', 0.7437840700149536),
 (u'faith', 0.7283919453620911),
 (u'glory', 0.7281145453453064),
 (u'mercy', 0.7187720537185669),
 (u'hosts', 0.7179254293441772),
 (u'gospel', 0.7167999148368835)]
 
In [38]: bible_kjv_word2vec_model.most_similar(["god"], topn=30)
Out[38]: 
[(u'christ', 0.7863791584968567),
 (u'lord', 0.7807695865631104),
 (u'salvation', 0.772181510925293),
 (u'truth', 0.7689207792282104),
 (u'spirit', 0.7437840700149536),
 (u'faith', 0.7283919453620911),
 (u'glory', 0.7281145453453064),
 (u'mercy', 0.7187720537185669),
 (u'hosts', 0.7179254293441772),
 (u'gospel', 0.7167999148368835),
 (u'grace', 0.6984926462173462),
 (u'kingdom', 0.6883569359779358),
 (u'word', 0.6729788184165955),
 (u'wisdom', 0.6717872023582458),
 (u'righteousness', 0.6678392291069031),
 (u'judgment', 0.6650925874710083),
 (u'hope', 0.6614011526107788),
 (u'fear', 0.6607920527458191),
 (u'power', 0.6554194092750549),
 (u'who', 0.6502907276153564),
 (u'law', 0.6491219401359558),
 (u'name', 0.6448863744735718),
 (u'commandment', 0.6375595331192017),
 (u'covenant', 0.6254858374595642),
 (u'thus', 0.618947446346283),
 (u'servant', 0.6186059713363647),
 (u'supplication', 0.6185135841369629),
 (u'prayer', 0.6138496398925781),
 (u'world', 0.6129024028778076),
 (u'strength', 0.6128018498420715)]

It’s not clear in the Python interpreter, we can play it with the Word2vec visualization demo:

Second, “Jesus”:

In [39]: bible_kjv_word2vec_model.most_similar(["jesus"], topn=30)
Out[39]: 
[(u'david', 0.7681761980056763),
 (u'moses', 0.7144877910614014),
 (u'peter', 0.6529564261436462),
 (u'paul', 0.6443019509315491),
 (u'saul', 0.6364974975585938),
 (u'jeremiah', 0.627094030380249),
 (u'who', 0.614865243434906),
 (u'joshua', 0.6089344024658203),
 (u'abraham', 0.6082783937454224),
 (u'esther', 0.6062946319580078),
 (u'john', 0.6049948930740356),
 (u'king', 0.6048218011856079),
 (u'balaam', 0.597400963306427),
 (u'christ', 0.597069501876831),
 (u'word', 0.5932905673980713),
 (u'samuel', 0.5922203063964844),
 (u'mordecai', 0.5911144614219666),
 (u'him', 0.5875434279441833),
 (u'prophet', 0.5794544219970703),
 (u'pharaoh', 0.5698455572128296),
 (u'messengers', 0.5664196014404297),
 (u'jacob', 0.5617039203643799),
 (u'daniel', 0.5585237741470337),
 (u'saying', 0.5573699474334717),
 (u'god', 0.5562971830368042),
 (u'thus', 0.5508617758750916),
 (u'sworn', 0.5429663062095642),
 (u'master', 0.5384684801101685),
 (u'esaias', 0.5353941321372986),
 (u'he', 0.5342475175857544)]

Word2vec visualization Demo for “Jesus”:

Third, “Abraham”:

In [40]: bible_kjv_word2vec_model.most_similar(["abraham"], topn=30)
Out[40]: 
[(u'isaac', 0.9007743000984192),
 (u'jacob', 0.8846864700317383),
 (u'esau', 0.7379217743873596),
 (u'joseph', 0.7284103035926819),
 (u'solomon', 0.7238803505897522),
 (u'daniel', 0.7140511274337769),
 (u'david', 0.7065653800964355),
 (u'moses', 0.6977373957633972),
 (u'attend', 0.6717492341995239),
 (u'jonadab', 0.6707518696784973),
 (u'hezekiah', 0.6678087711334229),
 (u'timothy', 0.6653313636779785),
 (u'jesse', 0.6586748361587524),
 (u'joshua', 0.6527853012084961),
 (u'pharaoh', 0.6472733020782471),
 (u'aaron', 0.6444283127784729),
 (u'church', 0.6429852247238159),
 (u'hamor', 0.6401337385177612),
 (u'jeremiah', 0.6318649649620056),
 (u'john', 0.6243793964385986),
 (u'nun', 0.6216053366661072),
 (u'jephunneh', 0.6153846979141235),
 (u'amoz', 0.6135494709014893),
 (u'praises', 0.6104753017425537),
 (u'joab', 0.609726071357727),
 (u'caleb', 0.6083548069000244),
 (u'jesus', 0.6082783341407776),
 (u'belteshazzar', 0.6075813174247742),
 (u'letters', 0.606890082359314),
 (u'confirmed', 0.606576681137085)]

Word2vec visualization demo for “Abraham”:

Finally, “Moses”:

In [41]: bible_kjv_word2vec_model.most_similar(["moses"], topn=30)
Out[41]: 
[(u'joshua', 0.8120298385620117),
 (u'jeremiah', 0.7481369972229004),
 (u'david', 0.7417373657226562),
 (u'jesus', 0.7144876718521118),
 (u'samuel', 0.7133205533027649),
 (u'abraham', 0.6977373957633972),
 (u'daniel', 0.6943730115890503),
 (u'paul', 0.6868774890899658),
 (u'balaam', 0.6845300793647766),
 (u'john', 0.6638209819793701),
 (u'hezekiah', 0.6563856601715088),
 (u'solomon', 0.6481155157089233),
 (u'letters', 0.6409181952476501),
 (u'messengers', 0.6316184997558594),
 (u'joseph', 0.6288775205612183),
 (u'esaias', 0.616001546382904),
 (u'joab', 0.6061952710151672),
 (u'pharaoh', 0.5786489844322205),
 (u'jacob', 0.5751779079437256),
 (u'church', 0.5716862678527832),
 (u'spake', 0.570705771446228),
 (u'balak', 0.5679874420166016),
 (u'peter', 0.5658782720565796),
 (u'nebuchadnezzar', 0.5637866258621216),
 (u'saul', 0.5635569095611572),
 (u'prophesied', 0.5563102960586548),
 (u'esther', 0.5491665601730347),
 (u'prayed', 0.5470476150512695),
 (u'isaac', 0.5445208549499512),
 (u'aaron', 0.5425761938095093)]

Word2vec visualization demo for “Moses”:

You can paly with other word2vec model based on the nltk corpus like this, just enjoy it.

Posted by TextMiner

Deep Learning Specialization on Coursera

Comments

Dive Into NLTK, Part X: Play with Word2Vec Models based on NLTK Corpus — 2 Comments

Leave a Reply

Your email address will not be published. Required fields are marked *