HomeDeep LearningUpdate Korean, Russian, French, German, Spanish Wikipedia Word2Vec Model for Word Similarity
Deep Learning Specialization on Coursera

I have launched WordSimilarity on April, which focused on computing the word similarity between two words by word2vec model based on the Wikipedia data. The website has the English Word2Vec Model for English Word Similarity: Exploiting Wikipedia Word Similarity by Word2Vec, Chinese Word2Vec Model for Chinese Word Similarity:Training a Chinese Wikipedia Word2Vec Model by Gensim and Jieba, and Japanese Word2Vec Model for Japanese Word Similarity: Training a Japanese Wikipedia Word2Vec Model by Gensim and Mecab.

Recently, I have updated Korean, Russian, French, German, Spanish Wikipedia Word2Vec Model for Word Similarity, It now supports eight language word similarity searching and computing. You can search any words on the WordSimilarity website, or compute any two words similarity by Word Similarity API.

Train the wikipedia word2vec model is very easy, you can use the “Wikipedia_Word2vec” script to train any language wikipedia model. For example, you can get the latest Korean wikipedia data easily:

https://dumps.wikimedia.org/kowiki/latest/kowiki-latest-pages-articles.xml.bz2

Then use the process_wiki.py script to get the Korean Wiki text:

python ../v1/process_wiki.py kowiki-latest-pages-articles.xml.bz2 wiki.ko.text
2017-08-07 23:16:44,927: INFO: running ../v1/process_wiki.py kowiki-latest-pages-articles.xml.bz2 wiki.ko.text
2017-08-07 23:16:56,500: INFO: Saved 10000 articles
2017-08-07 23:17:05,333: INFO: Saved 20000 articles
2017-08-07 23:17:13,959: INFO: Saved 30000 articles
2017-08-07 23:17:21,510: INFO: Saved 40000 articles
2017-08-07 23:17:28,125: INFO: Saved 50000 articles
2017-08-07 23:17:35,814: INFO: Saved 60000 articles
2017-08-07 23:17:42,009: INFO: Saved 70000 articles
2017-08-07 23:17:47,201: INFO: Saved 80000 articles
2017-08-07 23:17:54,440: INFO: Saved 90000 articles
2017-08-07 23:18:01,793: INFO: Saved 100000 articles
2017-08-07 23:18:09,626: INFO: Saved 110000 articles
2017-08-07 23:18:17,144: INFO: Saved 120000 articles
2017-08-07 23:18:24,624: INFO: Saved 130000 articles
2017-08-07 23:18:30,092: INFO: Saved 140000 articles
2017-08-07 23:18:37,414: INFO: Saved 150000 articles
2017-08-07 23:18:44,703: INFO: Saved 160000 articles
2017-08-07 23:18:55,223: INFO: Saved 170000 articles
2017-08-07 23:19:06,440: INFO: Saved 180000 articles
2017-08-07 23:19:13,733: INFO: Saved 190000 articles
2017-08-07 23:19:21,682: INFO: Saved 200000 articles
2017-08-07 23:19:28,754: INFO: Saved 210000 articles
2017-08-07 23:19:38,376: INFO: Saved 220000 articles
2017-08-07 23:19:45,860: INFO: Saved 230000 articles
2017-08-07 23:19:55,526: INFO: Saved 240000 articles
2017-08-07 23:20:02,927: INFO: Saved 250000 articles
2017-08-07 23:20:08,585: INFO: finished iterating over Wikipedia corpus of 257403 documents with 73517435 positions (total 1008924 articles, 78407073 positions before pruning articles shorter than 50 words)
2017-08-07 23:20:08,585: INFO: Finished Saved 257403 articles

The Korean texts are segmented automatically by space:

음계 音階 음악에서 음높이 pitch 순서로 음의 집합을 말한다 악곡을 주로 구성하는 음을 나타낸 것이며 음계의 종류에 따라 곡의 분위기가 달라진다 음계의 각각의 음에는 위치에 따라 도수가 붙는다 음계의 종류 음계는 음계가 포함하고 있는 음정 interval 따라서 이름을 붙일 있다 예시 온음계 반음계 온음음계 또는 음계가 포함하고 있는 서로 다른 피치 클래스의 수에 따라서 이름을 붙일 있다 팔음 음계 칠음 음계 육음 음계와 오음 음계 사음 음계 삼음 음계와 이음 음계 모노토닉 음계 음계의 음정 interval 뿐만 아니라 음계를 만드는 note 수가 문화권의 음악에 독특한 음악적 특징을 지니게 한다 어떤 음계의 음의 수보다 음의 거리 interval pitch distance 음악의 소리에 대해서 많은 것을 알려준다 온음계와 반음계 온음계와 반음계 半音階 서양 음악에서 쓰이는 용어이다 자체로는 음계에 관한 말이지만 온음계적 반음계적인 선율 화음 화성 진행 등의 표현으로도 쓰인다 대부분의 경우 온음계는 음으로 이루어진 장음계를 말한다 세기 음악론에서는 반음계가 아닌 모든 음계 이를테면 팔음음계 말할 쓰이기도 한다 반음계는 개의 반음으로 이루어진 음계를 말한다 계이름 계이름은 음계를 기준으로 음의 이름이다 장음계를 이루는 음의 계이름은 으뜸음부터 위로 올라가면서 각각 도가 된다 한국과 중국의 전통 음계 서양 음악에서는 시로 음계가 많이 쓰이지만 한국 전통 음악에는 황종 黃鍾 태주 太蔟 중려 仲呂 임종 林鍾 무역 無射 으로 음계가 많이 쓰이고 중국 전통 음악에는 변치 變徵 올림화 fa 변궁 變宮 시로 음계를 많이 쓴다 한국 전통 음악에서는 음계 외에도 음계 또는 악계통에서는 음계 등이 쓰인다 각주

Now you can train a Korean Word2Vec Model by train_word2vec_model.py:

python ../v1/train_word2vec_model.py wiki.ko.text ko_word2vec_gensim ko_word2vec_org
2017-08-07 23:27:18,800: INFO: running ../v1/train_word2vec_model.py wiki.ko.text ko_word2vec_gensim ko_word2vec_org
2017-08-07 23:27:18,800: INFO: collecting all words and their counts
2017-08-07 23:27:18,800: INFO: PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2017-08-07 23:27:20,811: INFO: PROGRESS: at sentence #10000, processed 6190288 words, keeping 943409 word types
2017-08-07 23:27:22,675: INFO: PROGRESS: at sentence #20000, processed 11027345 words, keeping 1430915 word types
2017-08-07 23:27:24,123: INFO: PROGRESS: at sentence #30000, processed 15581115 words, keeping 1819605 word types
2017-08-07 23:27:25,371: INFO: PROGRESS: at sentence #40000, processed 19478048 words, keeping 2113958 word types
2017-08-07 23:27:26,488: INFO: PROGRESS: at sentence #50000, processed 22883106 words, keeping 2355537 word types
2017-08-07 23:27:27,613: INFO: PROGRESS: at sentence #60000, processed 26274833 words, keeping 2587852 word types
2017-08-07 23:27:29,547: INFO: PROGRESS: at sentence #70000, processed 29494682 words, keeping 2826484 word types
2017-08-07 23:27:30,524: INFO: PROGRESS: at sentence #80000, processed 32357490 words, keeping 3024456 word types
2017-08-07 23:27:31,432: INFO: PROGRESS: at sentence #90000, processed 35020390 words, keeping 3197121 word types
2017-08-07 23:27:32,367: INFO: PROGRESS: at sentence #100000, processed 37775556 words, keeping 3363529 word types
2017-08-07 23:27:33,324: INFO: PROGRESS: at sentence #110000, processed 40561155 words, keeping 3532934 word types
2017-08-07 23:27:34,236: INFO: PROGRESS: at sentence #120000, processed 43216779 words, keeping 3686615 word types
......
2017-08-07 23:27:40,064: INFO: PROGRESS: at sentence #190000, processed 59806298 words, keeping 4598084 word types
2017-08-07 23:27:40,851: INFO: PROGRESS: at sentence #200000, processed 62026306 words, keeping 4719727 word types
2017-08-07 23:27:41,546: INFO: PROGRESS: at sentence #210000, processed 63989571 words, keeping 4816019 word types
2017-08-07 23:27:42,308: INFO: PROGRESS: at sentence #220000, processed 66165193 words, keeping 4926818 word types
2017-08-07 23:27:43,021: INFO: PROGRESS: at sentence #230000, processed 68174316 words, keeping 5021299 word types
2017-08-07 23:27:43,711: INFO: PROGRESS: at sentence #240000, processed 70105137 words, keeping 5111127 word types
2017-08-07 23:27:44,408: INFO: PROGRESS: at sentence #250000, processed 72026084 words, keeping 5200955 word types
2017-08-07 23:27:44,955: INFO: collected 5269999 word types from a corpus of 73517435 raw words and 257511 sentences
2017-08-07 23:27:44,956: INFO: Loading a fresh vocabulary
2017-08-07 23:27:55,680: INFO: min_count=5 retains 912014 unique words (17% of original 5269999, drops 4357985)
2017-08-07 23:27:55,680: INFO: min_count=5 leaves 67141678 word corpus (91% of original 73517435, drops 6375757)
2017-08-07 23:27:57,597: INFO: deleting the raw counts dictionary of 5269999 items
2017-08-07 23:27:58,733: INFO: sample=0.001 downsamples 3 most-common words
2017-08-07 23:27:58,734: INFO: downsampling leaves estimated 66065873 word corpus (98.4% of prior 67141678)
2017-08-07 23:27:58,734: INFO: estimated required memory for 912014 words and 200 dimensions: 1915229400 bytes
2017-08-07 23:28:01,357: INFO: resetting layer weights
2017-08-07 23:28:11,028: INFO: training model with 12 workers on 912014 vocabulary and 200 features, using sg=0 hs=0 sample=0.001 negative=5 window=5
2017-08-07 23:28:11,029: INFO: expecting 257511 sentences, matching count from corpus used for vocabulary survey
2017-08-07 23:28:12,046: INFO: PROGRESS: at 0.07% examples, 531447 words/s, in_qsize 0, out_qsize 0
2017-08-07 23:28:13,050: INFO: PROGRESS: at 0.14% examples, 597438 words/s, in_qsize 0, out_qsize 0
2017-08-07 23:28:14,055: INFO: PROGRESS: at 0.21% examples, 614817 words/s, in_qsize 0, out_qsize 1
2017-08-07 23:28:15,055: INFO: PROGRESS: at 0.29% examples, 613850 words/s, in_qsize 0, out_qsize 1
2017-08-07 23:28:16,055: INFO: PROGRESS: at 0.36% examples, 608353 words/s, in_qsize 0, out_qsize 0
2017-08-07 23:28:17,060: INFO: PROGRESS: at 0.50% examples, 610351 words/s, in_qsize 0, out_qsize 1
2017-08-07 23:28:18,075: INFO: PROGRESS: at 0.60% examples, 619027 words/s, in_qsize 0, out_qsize 0
2017-08-07 23:28:19,083: INFO: PROGRESS: at 0.68% examples, 618513 words/s, in_qsize 0, out_qsize 1
2017-08-07 23:28:20,086: INFO: PROGRESS: at 0.77% examples, 615396 words/s, in_qsize 0, out_qsize 0
2017-08-07 23:28:21,101: INFO: PROGRESS: at 0.86% examples, 616795 words/s, in_qsize 0, out_qsize 0
2017-08-07 23:28:22,117: INFO: PROGRESS: at 0.97% examples, 612711 words/s, in_qsize 0, out_qsize 1
2017-08-07 23:28:23,128: INFO: PROGRESS: at 1.09% examples, 618301 words/s, in_qsize 0, out_qsize 0
2017-08-07 23:28:24,157: INFO: PROGRESS: at 1.21% examples, 612963 words/s, in_qsize 0, out_qsize 0
2017-08-07 23:28:25,162: INFO: PROGRESS: at 1.29% examples, 607414 words/s, in_qsize 0, out_qsize 0
......
2017-08-07 23:36:48,694: INFO: PROGRESS: at 98.88% examples, 633305 words/s, in_qsize 0, out_qsize 0
2017-08-07 23:36:49,711: INFO: PROGRESS: at 99.20% examples, 633379 words/s, in_qsize 0, out_qsize 0
2017-08-07 23:36:50,715: INFO: PROGRESS: at 99.48% examples, 633352 words/s, in_qsize 0, out_qsize 0
2017-08-07 23:36:51,734: INFO: PROGRESS: at 99.76% examples, 633423 words/s, in_qsize 0, out_qsize 0
2017-08-07 23:36:52,495: INFO: worker thread finished; awaiting finish of 11 more threads
2017-08-07 23:36:52,498: INFO: worker thread finished; awaiting finish of 10 more threads
2017-08-07 23:36:52,500: INFO: worker thread finished; awaiting finish of 9 more threads
2017-08-07 23:36:52,500: INFO: worker thread finished; awaiting finish of 8 more threads
2017-08-07 23:36:52,500: INFO: worker thread finished; awaiting finish of 7 more threads
2017-08-07 23:36:52,501: INFO: worker thread finished; awaiting finish of 6 more threads
2017-08-07 23:36:52,501: INFO: worker thread finished; awaiting finish of 5 more threads
2017-08-07 23:36:52,501: INFO: worker thread finished; awaiting finish of 4 more threads
2017-08-07 23:36:52,507: INFO: worker thread finished; awaiting finish of 3 more threads
2017-08-07 23:36:52,512: INFO: worker thread finished; awaiting finish of 2 more threads
2017-08-07 23:36:52,513: INFO: worker thread finished; awaiting finish of 1 more threads
2017-08-07 23:36:52,522: INFO: worker thread finished; awaiting finish of 0 more threads
2017-08-07 23:36:52,522: INFO: training on 367587175 raw words (330330593 effective words) took 521.5s, 633443 effective words/s
2017-08-07 23:36:52,523: INFO: saving Word2Vec object under ko_word2vec_gensim, separately None
2017-08-07 23:36:52,523: INFO: not storing attribute syn0norm
2017-08-07 23:36:52,523: INFO: storing np array 'syn0' to ko_word2vec_gensim.wv.syn0.npy
2017-08-07 23:36:52,874: INFO: storing np array 'syn1neg' to ko_word2vec_gensim.syn1neg.npy
2017-08-07 23:36:53,206: INFO: not storing attribute cum_table
2017-08-07 23:37:02,385: INFO: saved ko_word2vec_gensim
2017-08-07 23:37:05,919: INFO: storing 912014x200 projection weights into ko_word2vec_org

You can train other languages word similarity model like Korean Word Similarity Model, such as Russian Word Similarity Model, French Word Similarity Model, German Word Similarity Model and Spanish Word Similarity Model. Just enjoy it.

Posted by TextMiner

Deep Learning Specialization on Coursera

Comments

Update Korean, Russian, French, German, Spanish Wikipedia Word2Vec Model for Word Similarity — No Comments

Leave a Reply

Your email address will not be published.