HomeNLPDive Into NLTK, Part I: Getting Started with NLTK
Deep Learning Specialization on Coursera

NLTK is the most famous Python Natural Language Processing Toolkit, here I will give a detail tutorial about NLTK. This is the first article in a series where I will write everything about NLTK with Python, especially about text mining and text analysis online.

This is the first article in the series “Dive Into NLTK”, here is an index of all the articles in the series that have been published to date:

Part I: Getting Started with NLTK (this article)
Part II: Sentence Tokenize and Word Tokenize
Part III: Part-Of-Speech Tagging and POS Tagger
Part IV: Stemming and Lemmatization
Part V: Using Stanford Text Analysis Tools in Python
Part VI: Add Stanford Word Segmenter Interface for Python NLTK
Part VII: A Preliminary Study on Text Classification
Part VIII: Using External Maximum Entropy Modeling Libraries for Text Classification
Part IX: From Text Classification to Sentiment Analysis
Part X: Play With Word2Vec Models based on NLTK Corpus

About NLTK

Here is a description from the NLTK official site:

NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning.

Installing NLTK

The following step is test on my mac os and a vps with ubuntu 12.04, just require your computer with Python 2.6 or Python 2.7, but I did’t test it on a windows computer. And I assume you could write some python code, and familiarity with Python modules and packages is also recommended. Here is the step to install NLTK on Mac/Unix:

Install Setuptools: http://pypi.python.org/pypi/setuptools
Install Pip: run sudo easy_install pip
Install Numpy (optional): run sudo pip install -U numpy
Install PyYAML and NLTK: run sudo pip install -U pyyaml nltk
Test installation: run python then type import nltk

Installing NLTK Data
After installing NLTK, you need install NLTK Data which include a lot of corpora, grammars, models and etc. Without NLTK Data, NLTK is nothing. You can find the complete nltk data list here: http://nltk.org/nltk_data/

The simplest way to install NLTK Data is run the Python interpreter and type the commands, following example is running on Mac Os:
[GCC 4.2.1 Compatible Apple Clang 4.0 (tags/Apple/clang-418.0.60)] on darwin
Type “help”, “copyright”, “credits” or “license” for more information.
>>> import nltk
>>> nltk.download()

A new window should open, showing the NLTK Downloader on Mac(Maybe same on Windows):

nltk_downloader_on_mac

Click on the File menu and select Change Download Directory, next, select the packages or collections you want to download, we suggest you select the “all” and download everything NLTK needed.

Graphical interface

If you install NLTK Data in a linux vps, no graphical interface, no window open, you still can use above nltk.download() command, you can following the follow step to download all nltk_data:

Python 2.7.3 (default, Sep 26 2013, 20:03:06)
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import nltk
>>> nltk.download()
NLTK Downloader
---------------------------------------------------------------------------
d) Download l) List u) Update c) Config h) Help q) Quit
---------------------------------------------------------------------------
Download which package (l=list; x=cancel)?
Downloader> l

Packages:
[*] maxent_ne_chunker... ACE Named Entity Chunker (Maximum entropy)
[*] abc................. Australian Broadcasting Commission 2006
[*] alpino.............. Alpino Dutch Treebank
[*] biocreative_ppi..... BioCreAtIvE (Critical Assessment of Information
Extraction Systems in Biology)
[*] brown............... Brown Corpus
[*] brown_tei........... Brown Corpus (TEI XML Version)
[*] cess_cat............ CESS-CAT Treebank
[*] cess_esp............ CESS-ESP Treebank
[*] chat80.............. Chat-80 Data Files
[*] city_database....... City Database
[*] cmudict............. The Carnegie Mellon Pronouncing Dictionary (0.6)
[*] comtrans............ ComTrans Corpus Sample
[*] conll2000........... CONLL 2000 Chunking Corpus
[*] conll2002........... CONLL 2002 Named Entity Recognition Corpus
[*] conll2007........... Dependency Treebanks from CoNLL 2007 (Catalan
and Basque Subset)
[*] dependency_treebank. Dependency Parsed Treebank
[*] europarl_raw........ Sample European Parliament Proceedings Parallel
Corpus
Hit Enter to continue:
....
Downloader> d

Download which package (l=list; x=cancel)?
Identifier> all

If you download everything(corpora, models, grammar) NLTK needed, you can test it by running:

Downloader> u

If showing “Nothing to update”, everything is ok.

Another way to install NLTK Data is using the command, I didn’t test this way, following is from official site:

Python 2.5-2.7: Run the command python -m nltk.downloader all. To ensure central installation, run the command sudo python -m nltk.downloader -d /usr/share/nltk_data all.

If you met the problem when downloading NLTK Data, such as download time out or other strange things, I suggest you download the NLTK data directly by nltk_data github page:

https://github.com/nltk/nltk_data

It said that “NLTK Data lives in the gh-pages branch of this repository.”, so you can visit the branch:

https://github.com/nltk/nltk_data/tree/master

Download the zip file and unzip it, then copy the six sub-directory in the packages into your nltk_data directory: chunkers, corpora, help, stemmers, taggers, tokenizers

Maybe this is the best unofficial way to install NLTK_Data.

Test NLTK

1) Test Brown Corpus:

>> from nltk.corpus import brown
>>> brown.words()[0:10]
['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of']
>>> brown.tagged_words()[0:10]
[('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL'), ('Grand', 'JJ-TL'), ('Jury', 'NN-TL'), ('said', 'VBD'), ('Friday', 'NR'), ('an', 'AT'), ('investigation', 'NN'), ('of', 'IN')]
>>> len(brown.words())
1161192
>>> dir(brown)
['__class__', '__delattr__', '__dict__', '__doc__', '__format__', '__getattribute__', '__hash__', '__init__', '__module__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_add', '_c2f', '_delimiter', '_encoding', '_f2c', '_file', '_fileids', '_get_root', '_init', '_map', '_para_block_reader', '_pattern', '_resolve', '_root', '_sent_tokenizer', '_sep', '_tag_mapping_function', '_word_tokenizer', 'abspath', 'abspaths', 'categories', 'encoding', 'fileids', 'open', 'paras', 'raw', 'readme', 'root', 'sents', 'tagged_paras', 'tagged_sents', 'tagged_words', 'words']

2) Test NLTK Book Resources:

>>> from nltk.book import *
*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908

>>> dir(text1)
['_CONTEXT_RE', '_COPY_TOKENS', '__class__', '__delattr__', '__dict__', '__doc__', '__format__', '__getattribute__', '__getitem__', '__hash__', '__init__', '__len__', '__module__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_collocations', '_context', '_num', '_vocab', '_window_size', 'collocations', 'common_contexts', 'concordance', 'count', 'dispersion_plot', 'findall', 'generate', 'index', 'name', 'plot', 'readability', 'similar', 'tokens', 'vocab']
>>> len(text1)
260819
Sperm Whale; Moby Dick; White Whale; old man; Captain Ahab; sperm
whale; Right Whale; Captain Peleg; New Bedford; Cape Horn; cried Ahab;
years ago; lower jaw; never mind; Father Mapple; cried Stubb; chief
mate; white whale; ivory leg; one hand

3) Sent Tokenize(sentence boundary detection, sentence segmentation), Word Tokenize and Pos Tagging:

>>> from nltk import sent_tokenize, word_tokenize, pos_tag
>>> text = "Machine learning is the science of getting computers to act without being explicitly programmed. In the past decade, machine learning has given us self-driving cars, practical speech recognition, effective web search, and a vastly improved understanding of the human genome. Machine learning is so pervasive today that you probably use it dozens of times a day without knowing it. Many researchers also think it is the best way to make progress towards human-level AI. In this class, you will learn about the most effective machine learning techniques, and gain practice implementing them and getting them to work for yourself. More importantly, you'll learn about not only the theoretical underpinnings of learning, but also gain the practical know-how needed to quickly and powerfully apply these techniques to new problems. Finally, you'll learn about some of Silicon Valley's best practices in innovation as it pertains to machine learning and AI."
>>> sents = sent_tokenize(text)
>>> sents
['Machine learning is the science of getting computers to act without being explicitly programmed.', 'In the past decade, machine learning has given us self-driving cars, practical speech recognition, effective web search, and a vastly improved understanding of the human genome.', 'Machine learning is so pervasive today that you probably use it dozens of times a day without knowing it.', 'Many researchers also think it is the best way to make progress towards human-level AI.', 'In this class, you will learn about the most effective machine learning techniques, and gain practice implementing them and getting them to work for yourself.', "More importantly, you'll learn about not only the theoretical underpinnings of learning, but also gain the practical know-how needed to quickly and powerfully apply these techniques to new problems.", "Finally, you'll learn about some of Silicon Valley's best practices in innovation as it pertains to machine learning and AI."]
>>> len(sents)
7
>>> tokens = word_tokenize(text)
>>> tokens
['Machine', 'learning', 'is', 'the', 'science', 'of', 'getting', 'computers', 'to', 'act', 'without', 'being', 'explicitly', 'programmed.', 'In', 'the', 'past', 'decade', ',', 'machine', 'learning', 'has', 'given', 'us', 'self-driving', 'cars', ',', 'practical', 'speech', 'recognition', ',', 'effective', 'web', 'search', ',', 'and', 'a', 'vastly', 'improved', 'understanding', 'of', 'the', 'human', 'genome.', 'Machine', 'learning', 'is', 'so', 'pervasive', 'today', 'that', 'you', 'probably', 'use', 'it', 'dozens', 'of', 'times', 'a', 'day', 'without', 'knowing', 'it.', 'Many', 'researchers', 'also', 'think', 'it', 'is', 'the', 'best', 'way', 'to', 'make', 'progress', 'towards', 'human-level', 'AI.', 'In', 'this', 'class', ',', 'you', 'will', 'learn', 'about', 'the', 'most', 'effective', 'machine', 'learning', 'techniques', ',', 'and', 'gain', 'practice', 'implementing', 'them', 'and', 'getting', 'them', 'to', 'work', 'for', 'yourself.', 'More', 'importantly', ',', 'you', "'ll", 'learn', 'about', 'not', 'only', 'the', 'theoretical', 'underpinnings', 'of', 'learning', ',', 'but', 'also', 'gain', 'the', 'practical', 'know-how', 'needed', 'to', 'quickly', 'and', 'powerfully', 'apply', 'these', 'techniques', 'to', 'new', 'problems.', 'Finally', ',', 'you', "'ll", 'learn', 'about', 'some', 'of', 'Silicon', 'Valley', "'s", 'best', 'practices', 'in', 'innovation', 'as', 'it', 'pertains', 'to', 'machine', 'learning', 'and', 'AI', '.']
>>> len(tokens)
161
>>> tagged_tokens = pos_tag(tokens)
>>> tagged_tokens
[('Machine', 'NN'), ('learning', 'NN'), ('is', 'VBZ'), ('the', 'DT'), ('science', 'NN'), ('of', 'IN'), ('getting', 'VBG'), ('computers', 'NNS'), ('to', 'TO'), ('act', 'VB'), ('without', 'IN'), ('being', 'VBG'), ('explicitly', 'RB'), ('programmed.', 'NNP'), ('In', 'NNP'), ('the', 'DT'), ('past', 'JJ'), ('decade', 'NN'), (',', ','), ('machine', 'NN'), ('learning', 'NN'), ('has', 'VBZ'), ('given', 'VBN'), ('us', 'PRP'), ('self-driving', 'JJ'), ('cars', 'NNS'), (',', ','), ('practical', 'JJ'), ('speech', 'NN'), ('recognition', 'NN'), (',', ','), ('effective', 'JJ'), ('web', 'NN'), ('search', 'NN'), (',', ','), ('and', 'CC'), ('a', 'DT'), ('vastly', 'RB'), ('improved', 'VBN'), ('understanding', 'NN'), ('of', 'IN'), ('the', 'DT'), ('human', 'JJ'), ('genome.', 'NNP'), ('Machine', 'NNP'), ('learning', 'NN'), ('is', 'VBZ'), ('so', 'RB'), ('pervasive', 'JJ'), ('today', 'NN'), ('that', 'WDT'), ('you', 'PRP'), ('probably', 'RB'), ('use', 'VBP'), ('it', 'PRP'), ('dozens', 'VBZ'), ('of', 'IN'), ('times', 'NNS'), ('a', 'DT'), ('day', 'NN'), ('without', 'IN'), ('knowing', 'NN'), ('it.', 'NNP'), ('Many', 'NNP'), ('researchers', 'NNS'), ('also', 'RB'), ('think', 'VBP'), ('it', 'PRP'), ('is', 'VBZ'), ('the', 'DT'), ('best', 'JJS'), ('way', 'NN'), ('to', 'TO'), ('make', 'VB'), ('progress', 'NN'), ('towards', 'NNS'), ('human-level', 'JJ'), ('AI.', 'NNP'), ('In', 'NNP'), ('this', 'DT'), ('class', 'NN'), (',', ','), ('you', 'PRP'), ('will', 'MD'), ('learn', 'VB'), ('about', 'IN'), ('the', 'DT'), ('most', 'RBS'), ('effective', 'JJ'), ('machine', 'NN'), ('learning', 'NN'), ('techniques', 'NNS'), (',', ','), ('and', 'CC'), ('gain', 'NN'), ('practice', 'NN'), ('implementing', 'VBG'), ('them', 'PRP'), ('and', 'CC'), ('getting', 'VBG'), ('them', 'PRP'), ('to', 'TO'), ('work', 'VB'), ('for', 'IN'), ('yourself.', 'NNP'), ('More', 'NNP'), ('importantly', 'RB'), (',', ','), ('you', 'PRP'), ("'ll", 'MD'), ('learn', 'VB'), ('about', 'IN'), ('not', 'RB'), ('only', 'RB'), ('the', 'DT'), ('theoretical', 'JJ'), ('underpinnings', 'NNS'), ('of', 'IN'), ('learning', 'VBG'), (',', ','), ('but', 'CC'), ('also', 'RB'), ('gain', 'VBP'), ('the', 'DT'), ('practical', 'JJ'), ('know-how', 'NN'), ('needed', 'VBN'), ('to', 'TO'), ('quickly', 'RB'), ('and', 'CC'), ('powerfully', 'RB'), ('apply', 'RB'), ('these', 'DT'), ('techniques', 'NNS'), ('to', 'TO'), ('new', 'JJ'), ('problems.', 'NNP'), ('Finally', 'NNP'), (',', ','), ('you', 'PRP'), ("'ll", 'MD'), ('learn', 'VB'), ('about', 'IN'), ('some', 'DT'), ('of', 'IN'), ('Silicon', 'NNP'), ('Valley', 'NNP'), ("'s", 'POS'), ('best', 'JJS'), ('practices', 'NNS'), ('in', 'IN'), ('innovation', 'NN'), ('as', 'IN'), ('it', 'PRP'), ('pertains', 'VBZ'), ('to', 'TO'), ('machine', 'NN'), ('learning', 'NN'), ('and', 'CC'), ('AI', 'NNP'), ('.', '.')]

A lot of text mining or text analysis things NLTK can do, we will introduce them in the following articles.

Posted by TextMiner

Deep Learning Specialization on Coursera

Comments

Dive Into NLTK, Part I: Getting Started with NLTK — 16 Comments

  1. Pingback: Dive Into NLTK, Part II: Sentence Tokenize and Word Tokenize | Text Mining Online | Text Analysis Online

  2. Pingback: Getting Started with Pattern | Text Mining Online | Text Analysis Online | Text Processing Online

  3. Pingback: Getting Started with MBSP | Text Mining Online | Text Analysis Online | Text Processing Online

  4. Pingback: Dive Into NLTK, Part III: Part-Of-Speech Tagging and POS Tagger | Text Mining Online | Text Analysis Online | Text Processing Online

  5. Pingback: Dive Into NLTK, Part VI: Add Stanford Word Segmenter Interface for Python NLTK | Text Mining Online | Text Analysis Online | Text Processing Online

  6. I am having trouble using sent_tokenize and word_tokenise. Heree is a description of what happens:

    >>> from nltk.tokenize import word_tokenize
    >>> word_tokenize(raw)
    Traceback (most recent call last):
    File “”, line 1, in
    word_tokenize(raw)
    File “C:\Users\prabhakarp\Downloads\WinPython-32bit-3.4.4.6Qt5\python-3.4.4\lib\site-packages\nltk\tokenize\__init__.py”, line 109, in word_tokenize
    return [token for sent in sent_tokenize(text, language)
    File “C:\Users\prabhakarp\Downloads\WinPython-32bit-3.4.4.6Qt5\python-3.4.4\lib\site-packages\nltk\tokenize\__init__.py”, line 93, in sent_tokenize
    tokenizer = load(‘tokenizers/punkt/{0}.pickle’.format(language))
    File “C:\Users\prabhakarp\Downloads\WinPython-32bit-3.4.4.6Qt5\python-3.4.4\lib\site-packages\nltk\data.py”, line 808, in load
    opened_resource = _open(resource_url)
    File “C:\Users\prabhakarp\Downloads\WinPython-32bit-3.4.4.6Qt5\python-3.4.4\lib\site-packages\nltk\data.py”, line 926, in _open
    return find(path_, path + [”]).open()
    File “C:\Users\prabhakarp\Downloads\WinPython-32bit-3.4.4.6Qt5\python-3.4.4\lib\site-packages\nltk\data.py”, line 648, in find
    raise LookupError(resource_not_found)
    LookupError:
    **********************************************************************
    Resource ‘tokenizers/punkt/english.pickle’ not found. Please
    use the NLTK Downloader to obtain the resource: >>>
    nltk.download()
    Searched in:
    – ‘C:\\Users\\prabhakarp\\Downloads\\WinPython-32bit-3.4.4.6Qt5\\settings/nltk_data’
    – ‘C:\\nltk_data’
    – ‘D:\\nltk_data’
    – ‘E:\\nltk_data’
    – ‘C:\\Users\\prabhakarp\\Downloads\\WinPython-32bit-3.4.4.6Qt5\\python-3.4.4\\nltk_data’
    – ‘C:\\Users\\prabhakarp\\Downloads\\WinPython-32bit-3.4.4.6Qt5\\python-3.4.4\\lib\\nltk_data’
    – ‘C:\\Users\\prabhakarp\\AppData\\Roaming\\nltk_data’
    – ”
    **********************************************************************
    >>> from nltk import sent_tokenize, word_tokenize, pos_tag
    >>> type(raw)

    >>> len(raw)
    4940016
    >>> tokens = word_tokenize(raw)
    Traceback (most recent call last):
    File “”, line 1, in
    tokens = word_tokenize(raw)
    File “C:\Users\prabhakarp\Downloads\WinPython-32bit-3.4.4.6Qt5\python-3.4.4\lib\site-packages\nltk\tokenize\__init__.py”, line 109, in word_tokenize
    return [token for sent in sent_tokenize(text, language)
    File “C:\Users\prabhakarp\Downloads\WinPython-32bit-3.4.4.6Qt5\python-3.4.4\lib\site-packages\nltk\tokenize\__init__.py”, line 93, in sent_tokenize
    tokenizer = load(‘tokenizers/punkt/{0}.pickle’.format(language))
    File “C:\Users\prabhakarp\Downloads\WinPython-32bit-3.4.4.6Qt5\python-3.4.4\lib\site-packages\nltk\data.py”, line 808, in load
    opened_resource = _open(resource_url)
    File “C:\Users\prabhakarp\Downloads\WinPython-32bit-3.4.4.6Qt5\python-3.4.4\lib\site-packages\nltk\data.py”, line 926, in _open
    return find(path_, path + [”]).open()
    File “C:\Users\prabhakarp\Downloads\WinPython-32bit-3.4.4.6Qt5\python-3.4.4\lib\site-packages\nltk\data.py”, line 648, in find
    raise LookupError(resource_not_found)
    LookupError:
    **********************************************************************
    Resource ‘tokenizers/punkt/english.pickle’ not found. Please
    use the NLTK Downloader to obtain the resource: >>>
    nltk.download()
    Searched in:
    – ‘C:\\Users\\prabhakarp\\Downloads\\WinPython-32bit-3.4.4.6Qt5\\settings/nltk_data’
    – ‘C:\\nltk_data’
    – ‘D:\\nltk_data’
    – ‘E:\\nltk_data’
    – ‘C:\\Users\\prabhakarp\\Downloads\\WinPython-32bit-3.4.4.6Qt5\\python-3.4.4\\nltk_data’
    – ‘C:\\Users\\prabhakarp\\Downloads\\WinPython-32bit-3.4.4.6Qt5\\python-3.4.4\\lib\\nltk_data’
    – ‘C:\\Users\\prabhakarp\\AppData\\Roaming\\nltk_data’
    – ”
    **********************************************************************
    >>> nltk.download(‘tokenize/punkt/english.pickle’)
    [nltk_data] Error loading tokenize/punkt/english.pickle:
    False
    >>>

    How do I resolve this? Thansk for any help.

    • Hi!

      i faced the same problem. it looks like you have the punkct.zip in where
      you deployed like C:\nltk_data\tokenizers
      extract the punckt.zip file like C:\nltk_data\tokenizers\punkt

      This resolved the issue for me.

  7. i want SENTIMENT ANALYSIS OF PRODUCT REVIEWS USING DEEP LEARNING. can anyone help me. i am a beginner, i don’t know how to do.

    reviews->Pre-processing->Polarity Calculation->Classifier-> output (pos/neg)

    this is the Algorithm followed by me. can anyone help me in “how to give input to Classifier?”

Leave a Reply

Your email address will not be published.