Dive Into NLTK, Part X: Play with Word2Vec Models based on NLTK Corpus
This is the tenth article in the series “Dive Into NLTK“, here is an index of all the articles in the series that have been published to date:
Part I: Getting Started with NLTK
Part II: Sentence Tokenize and Word Tokenize
Part III: Part-Of-Speech Tagging and POS Tagger
Part IV: Stemming and Lemmatization
Part V: Using Stanford Text Analysis Tools in Python
Part VI: Add Stanford Word Segmenter Interface for Python NLTK
Part VII: A Preliminary Study on Text Classification
Part VIII: Using External Maximum Entropy Modeling Libraries for Text Classification
Part IX: From Text Classification to Sentiment Analysis
Part X: Play With Word2Vec Models based on NLTK Corpus (This article)
NLTK Corpus
Accessing Text Corpora in NLTK is very easily. NLTK provides a NLTK Corpus Package to read and manage the corpus data. For example, NLTK includes a small selection of texts from the Project Gutenberg electronic text archive, which name Gutenberg Corpus. About Project Gutenberg:
Project Gutenberg (PG) is a volunteer effort to digitize and archive cultural works, to “encourage the creation and distribution of eBooks”. It was founded in 1971 by Michael S. Hart and is the oldest digital library. Most of the items in its collection are the full texts of public domain books. The project tries to make these as free as possible, in long-lasting, open formats that can be used on almost any computer. As of 3 October 2015, Project Gutenberg reached 50,000 items in its collection.
We can list the ebook file name of the Gutenberg Corpus in NLTK like this:
Python 2.7.6 (default, Jun 3 2014, 07:43:23) Type "copyright", "credits" or "license" for more information. IPython 3.1.0 -- An enhanced Interactive Python. ? -> Introduction and overview of IPython's features. %quickref -> Quick reference. help -> Python's own help system. object? -> Details about 'object', use 'object??' for extra details. In [1]: from nltk.corpus import gutenberg In [2]: gutenberg.fileids() Out[2]: [u'austen-emma.txt', u'austen-persuasion.txt', u'austen-sense.txt', u'bible-kjv.txt', u'blake-poems.txt', u'bryant-stories.txt', u'burgess-busterbrown.txt', u'carroll-alice.txt', u'chesterton-ball.txt', u'chesterton-brown.txt', u'chesterton-thursday.txt', u'edgeworth-parents.txt', u'melville-moby_dick.txt', u'milton-paradise.txt', u'shakespeare-caesar.txt', u'shakespeare-hamlet.txt', u'shakespeare-macbeth.txt', u'whitman-leaves.txt'] In [3]: austen_emma_words = gutenberg.words('austen-emma.txt') In [4]: len(austen_emma_words) Out[4]: 192427 In [5]: austen_emma_sents = gutenberg.sents('austen-emma.txt') In [6]: len(austen_emma_sents) Out[6]: 9111 In [7]: austen_emma_sents[0] Out[7]: [u'[', u'Emma', u'by', u'Jane', u'Austen', u'1816', u']'] In [8]: austen_emma_sents[5000] Out[8]: [u'I', u'know', u'the', u'danger', u'of', u'indulging', u'such', u'speculations', u'.'] |
Word2Vec
Narrowly speaking, the Word2Vec we said here is referred to Google Word2Vec Project, which first proposed by Tomas Mikolov and etc in 2013. For more Word2Vec related papers, tutorials, and coding examples, we recommend the “Getting started with Word2Vec” by TextProcessing.
Word2Vec in Python
The best python implementation of Word2Vec is Gensim Word2vec module: models.word2vec – Deep learning with word2vec. We have written an article about “Word2Vec in Python“, you can reference it first if you have no idea about gensim word2vec model: Getting Started with Word2Vec and GloVe in Python.
Bible Word2Vec Model
The Gutenberg Corpus in NLTK include a file named ‘bible-kjv.txt’, which is “The King James Version of the Bible“:
The King James Version (KJV), also known as Authorized [sic] Version (AV) or simply King James Bible (KJB), is an English translation of the Christian Bible for the Church of England begun in 1604 and completed in 1611.[a] The books of the King James Version include the 39 books of the Old Testament, an intertestamental section containing 14 books of the Apocrypha, and the 27 books of the New Testament.
Let’s train a Bible Word2Vec Model based on bible-kjv.txt corpus and gensim word2vec module:
In [14]: import logging In [15]: from gensim.models import word2vec In [16]: logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO) In [17]: bible_kjv_words = gutenberg.words('bible-kjv.txt') In [18]: len(bible_kjv_words) Out[18]: 1010654 In [19]: bible_kjv_sents = gutenberg.sents('bible-kjv.txt') In [20]: len(bible_kjv_sents) Out[20]: 30103 In [21]: bible_kjv_sents[0] Out[21]: [u'[', u'The', u'King', u'James', u'Bible', u']'] In [22]: bible_kjv_sents[1] Out[22]: [u'The', u'Old', u'Testament', u'of', u'the', u'King', u'James', u'Bible'] In [23]: bible_kjv_sents[2] Out[23]: [u'The', u'First', u'Book', u'of', u'Moses', u':', u'Called', u'Genesis'] In [24]: bible_kjv_sents[3] Out[24]: [u'1', u':', u'1', u'In', u'the', u'beginning', u'God', u'created', u'the', u'heaven', u'and', u'the', u'earth', u'.'] In [25]: from string import punctuation In [27]: punctuation Out[27]: '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~' # The words in nltk corpus has been word tokenized, so we just discard the punctuation from the words # and lowercased the words. In [29]: discard_punctuation_and_lowercased_sents = [[word.lower() for word in sent if word not in punctuation] for sent in bible_kjv_sents] In [30]: discard_punctuation_and_lowercased_sents[0] Out[30]: [u'the', u'king', u'james', u'bible'] In [31]: discard_punctuation_and_lowercased_sents[1] Out[31]: [u'the', u'old', u'testament', u'of', u'the', u'king', u'james', u'bible'] In [32]: discard_punctuation_and_lowercased_sents[2] Out[32]: [u'the', u'first', u'book', u'of', u'moses', u'called', u'genesis'] In [33]: discard_punctuation_and_lowercased_sents[3] Out[33]: [u'1', u'1', u'in', u'the', u'beginning', u'god', u'created', u'the', u'heaven', u'and', u'the', u'earth'] In [34]: bible_kjv_word2vec_model = word2vec.Word2Vec(discard_punctuation_and_lowercased_sents, min_count=5, size=200) 2017-03-26 21:05:20,811 : INFO : collecting all words and their counts 2017-03-26 21:05:20,811 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types 2017-03-26 21:05:20,972 : INFO : PROGRESS: at sentence #10000, processed 315237 words, keeping 7112 word types 2017-03-26 21:05:21,103 : INFO : PROGRESS: at sentence #20000, processed 572536 words, keeping 10326 word types 2017-03-26 21:05:21,247 : INFO : PROGRESS: at sentence #30000, processed 851126 words, keeping 12738 word types 2017-03-26 21:05:21,249 : INFO : collected 12752 word types from a corpus of 854209 raw words and 30103 sentences 2017-03-26 21:05:21,249 : INFO : Loading a fresh vocabulary 2017-03-26 21:05:21,441 : INFO : min_count=5 retains 5428 unique words (42% of original 12752, drops 7324) 2017-03-26 21:05:21,441 : INFO : min_count=5 leaves 841306 word corpus (98% of original 854209, drops 12903) 2017-03-26 21:05:21,484 : INFO : deleting the raw counts dictionary of 12752 items 2017-03-26 21:05:21,485 : INFO : sample=0.001 downsamples 62 most-common words 2017-03-26 21:05:21,485 : INFO : downsampling leaves estimated 583788 word corpus (69.4% of prior 841306) 2017-03-26 21:05:21,485 : INFO : estimated required memory for 5428 words and 200 dimensions: 11398800 bytes 2017-03-26 21:05:21,520 : INFO : resetting layer weights 2017-03-26 21:05:21,708 : INFO : training model with 3 workers on 5428 vocabulary and 200 features, using sg=0 hs=0 sample=0.001 negative=5 window=5 2017-03-26 21:05:21,708 : INFO : expecting 30103 sentences, matching count from corpus used for vocabulary survey 2017-03-26 21:05:22,721 : INFO : PROGRESS: at 16.10% examples, 474025 words/s, in_qsize 5, out_qsize 0 2017-03-26 21:05:23,728 : INFO : PROGRESS: at 34.20% examples, 500893 words/s, in_qsize 5, out_qsize 0 2017-03-26 21:05:24,734 : INFO : PROGRESS: at 49.48% examples, 482782 words/s, in_qsize 5, out_qsize 0 2017-03-26 21:05:25,742 : INFO : PROGRESS: at 60.97% examples, 442365 words/s, in_qsize 5, out_qsize 0 2017-03-26 21:05:26,758 : INFO : PROGRESS: at 76.39% examples, 443206 words/s, in_qsize 6, out_qsize 0 2017-03-26 21:05:27,770 : INFO : PROGRESS: at 95.14% examples, 460213 words/s, in_qsize 5, out_qsize 0 2017-03-26 21:05:28,002 : INFO : worker thread finished; awaiting finish of 2 more threads 2017-03-26 21:05:28,007 : INFO : worker thread finished; awaiting finish of 1 more threads 2017-03-26 21:05:28,013 : INFO : worker thread finished; awaiting finish of 0 more threads 2017-03-26 21:05:28,013 : INFO : training on 4271045 raw words (2918320 effective words) took 6.3s, 463711 effective words/s In [35]: bible_kjv_word2vec_model.save("bible_word2vec_gensim") 2017-03-26 21:06:03,500 : INFO : saving Word2Vec object under bible_word2vec_gensim, separately None 2017-03-26 21:06:03,501 : INFO : not storing attribute syn0norm 2017-03-26 21:06:03,501 : INFO : not storing attribute cum_table 2017-03-26 21:06:03,646 : INFO : saved bible_word2vec_gensim In [36]: bible_kjv_word2vec_model.wv.save_word2vec_format("bible_word2vec_org", "bible_word2vec_vocabulary") 2017-03-26 21:06:51,136 : INFO : storing vocabulary in bible_word2vec_vocabulary 2017-03-26 21:06:51,186 : INFO : storing 5428x200 projection weights into bible_word2vec_org |
We have trained the bible word2vec model from the nltk bible-kjv corpus, now we can test it with some word.
First, “God”:
In [37]: bible_kjv_word2vec_model.most_similar(["god"]) 2017-03-26 21:14:27,320 : INFO : precomputing L2-norms of word weight vectors Out[37]: [(u'christ', 0.7863791584968567), (u'lord', 0.7807695865631104), (u'salvation', 0.772181510925293), (u'truth', 0.7689207792282104), (u'spirit', 0.7437840700149536), (u'faith', 0.7283919453620911), (u'glory', 0.7281145453453064), (u'mercy', 0.7187720537185669), (u'hosts', 0.7179254293441772), (u'gospel', 0.7167999148368835)] In [38]: bible_kjv_word2vec_model.most_similar(["god"], topn=30) Out[38]: [(u'christ', 0.7863791584968567), (u'lord', 0.7807695865631104), (u'salvation', 0.772181510925293), (u'truth', 0.7689207792282104), (u'spirit', 0.7437840700149536), (u'faith', 0.7283919453620911), (u'glory', 0.7281145453453064), (u'mercy', 0.7187720537185669), (u'hosts', 0.7179254293441772), (u'gospel', 0.7167999148368835), (u'grace', 0.6984926462173462), (u'kingdom', 0.6883569359779358), (u'word', 0.6729788184165955), (u'wisdom', 0.6717872023582458), (u'righteousness', 0.6678392291069031), (u'judgment', 0.6650925874710083), (u'hope', 0.6614011526107788), (u'fear', 0.6607920527458191), (u'power', 0.6554194092750549), (u'who', 0.6502907276153564), (u'law', 0.6491219401359558), (u'name', 0.6448863744735718), (u'commandment', 0.6375595331192017), (u'covenant', 0.6254858374595642), (u'thus', 0.618947446346283), (u'servant', 0.6186059713363647), (u'supplication', 0.6185135841369629), (u'prayer', 0.6138496398925781), (u'world', 0.6129024028778076), (u'strength', 0.6128018498420715)] |
It’s not clear in the Python interpreter, we can play it with the Word2vec visualization demo:
Second, “Jesus”:
In [39]: bible_kjv_word2vec_model.most_similar(["jesus"], topn=30) Out[39]: [(u'david', 0.7681761980056763), (u'moses', 0.7144877910614014), (u'peter', 0.6529564261436462), (u'paul', 0.6443019509315491), (u'saul', 0.6364974975585938), (u'jeremiah', 0.627094030380249), (u'who', 0.614865243434906), (u'joshua', 0.6089344024658203), (u'abraham', 0.6082783937454224), (u'esther', 0.6062946319580078), (u'john', 0.6049948930740356), (u'king', 0.6048218011856079), (u'balaam', 0.597400963306427), (u'christ', 0.597069501876831), (u'word', 0.5932905673980713), (u'samuel', 0.5922203063964844), (u'mordecai', 0.5911144614219666), (u'him', 0.5875434279441833), (u'prophet', 0.5794544219970703), (u'pharaoh', 0.5698455572128296), (u'messengers', 0.5664196014404297), (u'jacob', 0.5617039203643799), (u'daniel', 0.5585237741470337), (u'saying', 0.5573699474334717), (u'god', 0.5562971830368042), (u'thus', 0.5508617758750916), (u'sworn', 0.5429663062095642), (u'master', 0.5384684801101685), (u'esaias', 0.5353941321372986), (u'he', 0.5342475175857544)] |
Word2vec visualization Demo for “Jesus”:
Third, “Abraham”:
In [40]: bible_kjv_word2vec_model.most_similar(["abraham"], topn=30) Out[40]: [(u'isaac', 0.9007743000984192), (u'jacob', 0.8846864700317383), (u'esau', 0.7379217743873596), (u'joseph', 0.7284103035926819), (u'solomon', 0.7238803505897522), (u'daniel', 0.7140511274337769), (u'david', 0.7065653800964355), (u'moses', 0.6977373957633972), (u'attend', 0.6717492341995239), (u'jonadab', 0.6707518696784973), (u'hezekiah', 0.6678087711334229), (u'timothy', 0.6653313636779785), (u'jesse', 0.6586748361587524), (u'joshua', 0.6527853012084961), (u'pharaoh', 0.6472733020782471), (u'aaron', 0.6444283127784729), (u'church', 0.6429852247238159), (u'hamor', 0.6401337385177612), (u'jeremiah', 0.6318649649620056), (u'john', 0.6243793964385986), (u'nun', 0.6216053366661072), (u'jephunneh', 0.6153846979141235), (u'amoz', 0.6135494709014893), (u'praises', 0.6104753017425537), (u'joab', 0.609726071357727), (u'caleb', 0.6083548069000244), (u'jesus', 0.6082783341407776), (u'belteshazzar', 0.6075813174247742), (u'letters', 0.606890082359314), (u'confirmed', 0.606576681137085)] |
Word2vec visualization demo for “Abraham”:
Finally, “Moses”:
In [41]: bible_kjv_word2vec_model.most_similar(["moses"], topn=30) Out[41]: [(u'joshua', 0.8120298385620117), (u'jeremiah', 0.7481369972229004), (u'david', 0.7417373657226562), (u'jesus', 0.7144876718521118), (u'samuel', 0.7133205533027649), (u'abraham', 0.6977373957633972), (u'daniel', 0.6943730115890503), (u'paul', 0.6868774890899658), (u'balaam', 0.6845300793647766), (u'john', 0.6638209819793701), (u'hezekiah', 0.6563856601715088), (u'solomon', 0.6481155157089233), (u'letters', 0.6409181952476501), (u'messengers', 0.6316184997558594), (u'joseph', 0.6288775205612183), (u'esaias', 0.616001546382904), (u'joab', 0.6061952710151672), (u'pharaoh', 0.5786489844322205), (u'jacob', 0.5751779079437256), (u'church', 0.5716862678527832), (u'spake', 0.570705771446228), (u'balak', 0.5679874420166016), (u'peter', 0.5658782720565796), (u'nebuchadnezzar', 0.5637866258621216), (u'saul', 0.5635569095611572), (u'prophesied', 0.5563102960586548), (u'esther', 0.5491665601730347), (u'prayed', 0.5470476150512695), (u'isaac', 0.5445208549499512), (u'aaron', 0.5425761938095093)] |
Word2vec visualization demo for “Moses”:
You can paly with other word2vec model based on the nltk corpus like this, just enjoy it.
Posted by TextMiner
Where can I find the visualization demo code?
I also have a hard time finding the visualization demo code
Please send the visualization demo code