HomeNamed Entity RecognitionDive Into NLTK, Part V: Using Stanford Text Analysis Tools in Python

This is the fifth article in the series “Dive Into NLTK“, here is an index of all the articles in the series that have been published to date:

Part I: Getting Started with NLTK
Part II: Sentence Tokenize and Word Tokenize
Part III: Part-Of-Speech Tagging and POS Tagger
Part IV: Stemming and Lemmatization
Part V: Using Stanford Text Analysis Tools in Python (this article)
Part VI: Add Stanford Word Segmenter Interface for Python NLTK
Part VII: A Preliminary Study on Text Classification
Part VIII: Using External Maximum Entropy Modeling Libraries for Text Classification

We have already discussed “How to Use Stanford Named Entity Recognizer (NER) in Python NLTK and Other Programming Languages“, and recently we have also tested the Stanford POS Tagger and Stanford Parser in NLTK and used it in Python. If you want use these Stanford Text Analysis tools in other languages, you can use our Text Analysis API which also integrated the Stanford NLP Tools in it. You can test it here on our online text analysis demo: Text Analysis Online. Now we will tell you how to use these Java NLP Tools in Python NLTK. You can also following the NLTK Official guide: Installing Third Party Software–How NLTK Discovers Third Party Software, here we test it in an ubuntu 12.04 VPS.

First you need set the Java environment for those Java text analysis tools before you using them in NLTK:

sudo apt-get install default-jre
This will install the Java Runtime Environment (JRE). If you instead need the Java Development Kit (JDK), which is usually needed to compile Java applications (for example Apache Ant, Apache Maven, Eclipse and IntelliJ IDEA execute the following command:

sudo apt-get install default-jdk
That is everything that is needed to install Java.

NLTK now provides three interfaces for Stanford Log-linear Part-Of-Speech Tagger, Stanford Named Entity Recognizer (NER) and Stanford Parser, following is the details about how to use them in NLTK one by one.

1) Stanford POS Tagger

Following is from the official Stanford POS Tagger website:

A Part-Of-Speech Tagger (POS Tagger) is a piece of software that reads text in some language and assigns parts of speech to each word (and other token), such as noun, verb, adjective, etc., although generally computational applications use more fine-grained POS tags like ‘noun-plural’.

Part-of-speech name abbreviations: The English taggers use the Penn Treebank tag set. Here are some links to documentation of the Penn Treebank English POS tag set: 1993 Computational Linguistics article in PDF, AMALGAM page, Aoife Cahill’s list. See the included README-Models.txt in the models directory for more information about the tagsets for the other languages.

We assumed you have installed the new version NLTK, here we use ipython and the NLTK version is 3.0.0b1:

In [1]: import nltk

In [2]: nltk.__version__
Out[2]: ‘3.0.0b1’

The Stanford POS Tagger official site provides two versions of POS Tagger:

Download basic English Stanford Tagger version 3.4.1 [21 MB]

Download full Stanford Tagger version 3.4.1 [124 MB]

We suggest you download the full version which contains a lot of models.

After downloading the full version, unzip it and copy the related data in our test directory:

mkdir postagger
cd postagger/
cp ../stanford-postagger-full-2014-08-27/stanford-postagger.jar .
cp -r ../stanford-postagger-full-2014-08-27/models .
ipython –pylab

First test the English POS Tagging Result:

In [1]: from nltk.tag.stanford import POSTagger

In [2]: english_postagger = POSTagger(‘models/english-bidirectional-distsim.tagger’, ‘stanford-postagger.jar’)

In [3]: english_postagger.tag(‘this is stanford postagger in nltk for python users’.split())
Out[3]:
[(u’this’, u’DT’),
(u’is’, u’VBZ’),
(u’stanford’, u’JJ’),
(u’postagger’, u’NN’),
(u’in’, u’IN’),
(u’nltk’, u’NN’),
(u’for’, u’IN’),
(u’python’, u’NN’),
(u’users’, u’NNS’)]

Then test the Chinese POS Tagging result:

In [4]: chinese_postagger = POSTagger(‘models/chinese-distsim.tagger’, ‘stanford-postagger.jar’, encoding=’utf-8′)

In [5]: chinese_postagger.tag(‘这 是 在 Python 环境 中 使用 斯坦福 词性 标 器’.split())
Out[5]:
[(”, u’\u8fd9#PN’),
(”, u’\u662f#VC’),
(”, u’\u5728#P’),
(”, u’Python#NN’),
(”, u’\u73af\u5883#NN’),
(”, u’\u4e2d#LC’),
(”, u’\u4f7f\u7528#VV’),
(”, u’\u65af\u5766\u798f#NR’),
(”, u’\u8bcd\u6027#JJ’),
(”, u’\u6807\u6ce8\u5668#NN’)]

The models contains a lot of pos tagger models, you can find the details info from the README-Models.txt:

English taggers
—————————
wsj-0-18-bidirectional-distsim.tagger
Trained on WSJ sections 0-18 using a bidirectional architecture and
including word shape and distributional similarity features.
Penn Treebank tagset.
Performance:
97.28% correct on WSJ 19-21
(90.46% correct on unknown words)

wsj-0-18-left3words.tagger
Trained on WSJ sections 0-18 using the left3words architecture and
includes word shape features. Penn tagset.
Performance:
96.97% correct on WSJ 19-21
(88.85% correct on unknown words)

wsj-0-18-left3words-distsim.tagger
Trained on WSJ sections 0-18 using the left3words architecture and
includes word shape and distributional similarity features. Penn tagset.
Performance:
97.01% correct on WSJ 19-21
(89.81% correct on unknown words)

english-left3words-distsim.tagger
Trained on WSJ sections 0-18 and extra parser training data using the
left3words architecture and includes word shape and distributional
similarity features. Penn tagset.

english-bidirectional-distsim.tagger
Trained on WSJ sections 0-18 using a bidirectional architecture and
including word shape and distributional similarity features.
Penn Treebank tagset.

wsj-0-18-caseless-left3words-distsim.tagger
Trained on WSJ sections 0-18 left3words architecture and includes word
shape and distributional similarity features. Penn tagset. Ignores case.

english-caseless-left3words-distsim.tagger
Trained on WSJ sections 0-18 and extra parser training data using the
left3words architecture and includes word shape and distributional
similarity features. Penn tagset. Ignores case.

Chinese tagger
—————————
chinese-nodistsim.tagger
Trained on a combination of CTB7 texts from Chinese and Hong Kong
sources.
LDC Chinese Treebank POS tag set.
Performance:
93.46% on a combination of Chinese and Hong Kong texts
(79.40% on unknown words)

chinese-distsim.tagger
Trained on a combination of CTB7 texts from Chinese and Hong Kong
sources with distributional similarity clusters.
LDC Chinese Treebank POS tag set.
Performance:
93.99% on a combination of Chinese and Hong Kong texts
(84.60% on unknown words)

Arabic tagger
—————————
arabic.tagger
Trained on the *entire* ATB p1-3.
When trained on the train part of the ATB p1-3 split done for the 2005
JHU Summer Workshop (Diab split), using (augmented) Bies tags, it gets
the following performance:
96.26% on test portion according to Diab split
(80.14% on unknown words)

French tagger
—————————
french.tagger
Trained on the French treebank.

German tagger
—————————
german-hgc.tagger
Trained on the first 80% of the Negra corpus, which uses the STTS tagset.
The Stuttgart-Tübingen Tagset (STTS) is a set of 54 tags for annotating
German text corpora with part-of-speech labels, which was jointly
developed by the Institut für maschinelle Sprachverarbeitung of the
University of Stuttgart and the Seminar für Sprachwissenschaft of the
University of Tübingen. See:
http://www.ims.uni-stuttgart.de/projekte/CQPDemos/Bundestag/help-tagset.html
This model uses features from the distributional similarity clusters
built over the HGC.
Performance:
96.90% on the first half of the remaining 20% of the Negra corpus (dev set)
(90.33% on unknown words)

german-dewac.tagger
This model uses features from the distributional similarity clusters
built from the deWac web corpus.

german-fast.tagger
Lacks distributional similarity features, but is several times faster
than the other alternatives.
Performance:
96.61% overall / 86.72% unknown.

2) Stanford Named Entity Recognizer (NER)

Following introduction is from the official Stanford NER website:

Stanford NER is a Java implementation of a Named Entity Recognizer. Named Entity Recognition (NER) labels sequences of words in a text which are the names of things, such as person and company names, or gene and protein names. It comes with well-engineered feature extractors for Named Entity Recognition, and many options for defining feature extractors. Included with the download are good named entity recognizers for English, particularly for the 3 classes (PERSON, ORGANIZATION, LOCATION), and we also make available on this page various other models for different languages and circumstances, including models trained on just the CoNLL 2003 English training data. The distributional similarity features in some models improve performance but the models require considerably more memory.

The website provides a download version of Stanford NER:

Download Stanford Named Entity Recognizer version 3.4.1

It contains the stanford-ner.jar and models in the classifies directory, and like the Stanford POS Tagger, you can use it in NLTK like this:

In [1]: from nltk.tag.stanford import NERTagger

In [2]: english_nertagger = NERTagger(‘classifiers/english.all.3class.distsim.crf.ser.gz’, ‘stanford-ner.jar’)

In [3]: english_nertagger.tag(‘Rami Eid is studying at Stony Brook University in NY’.split())
Out[3]:
[(u’Rami’, u’PERSON’),
(u’Eid’, u’PERSON’),
(u’is’, u’O’),
(u’studying’, u’O’),
(u’at’, u’O’),
(u’Stony’, u’ORGANIZATION’),
(u’Brook’, u’ORGANIZATION’),
(u’University’, u’ORGANIZATION’),
(u’in’, u’O’),
(u’NY’, u’O’)]

The Models Included with Stanford NER are a 4 class model trained for CoNLL, a 7 class model trained for MUC, and a 3 class model trained on both data sets for the intersection of those class sets.

3 class: Location, Person, Organization
4 class: Location, Person, Organization, Misc
7 class: Time, Location, Organization, Person, Money, Percent, Date

-rw-r–r–@ 1 textminer staff 24732086 9 7 11:43 english.all.3class.distsim.crf.ser.gz
-rw-r–r–@ 1 textminer staff 1274 9 7 11:43 english.all.3class.distsim.prop
-rw-r–r–@ 1 textminer staff 18350357 9 7 11:43 english.conll.4class.distsim.crf.ser.gz
-rw-r–r–@ 1 textminer staff 1421 9 7 11:43 english.conll.4class.distsim.prop
-rw-r–r–@ 1 textminer staff 17824631 9 7 11:43 english.muc.7class.distsim.crf.ser.gz
-rw-r–r–@ 1 textminer staff 1087 9 7 11:43 english.muc.7class.distsim.prop
-rw-r–r–@ 1 textminer staff 18954462 9 7 11:43 english.nowiki.3class.distsim.crf.ser.gz
-rw-r–r–@ 1 textminer staff 1218 9 7 11:43 english.nowiki.3class.distsim.prop

You can test the 7 class Stanford NER on our Text Analysis Online Demo: NLTK Stanford Named Entity Recognizer for 7Class

3) Stanford Parser

From the official Stanford Parser introduction:

A natural language parser is a program that works out the grammatical structure of sentences, for instance, which groups of words go together (as “phrases”) and which words are the subject or object of a verb. Probabilistic parsers use knowledge of language gained from hand-parsed sentences to try to produce the most likely analysis of new sentences. These statistical parsers still make some mistakes, but commonly work rather well. Their development was one of the biggest breakthroughs in natural language processing in the 1990s.

You should download the Stanford Parser first: Download Stanford Parser version 3.4.1, then use it in the Python by NLTK:

In [1]: from nltk.parse.stanford import StanfordParser

In [3]: english_parser = StanfordParser(‘stanford-parser.jar’, ‘stanford-parser-3.4-models.jar’)

In [4]: english_parser.raw_parse_sents((“this is the english parser test”, “the parser is from stanford parser”))
Out[4]:
[[u’this/DT is/VBZ the/DT english/JJ parser/NN test/NN’],
[u'(ROOT’,
u’ (S’,
u’ (NP (DT this))’,
u’ (VP (VBZ is)’,
u’ (NP (DT the) (JJ english) (NN parser) (NN test)))))’],
[u’nsubj(test-6, this-1)’,
u’cop(test-6, is-2)’,
u’det(test-6, the-3)’,
u’amod(test-6, english-4)’,
u’nn(test-6, parser-5)’,
u’root(ROOT-0, test-6)’],
[u’the/DT parser/NN is/VBZ from/IN stanford/JJ parser/NN’],
[u'(ROOT’,
u’ (S’,
u’ (NP (DT the) (NN parser))’,
u’ (VP (VBZ is)’,
u’ (PP (IN from)’,
u’ (NP (JJ stanford) (NN parser))))))’],
[u’det(parser-2, the-1)’,
u’nsubj(is-3, parser-2)’,
u’root(ROOT-0, is-3)’,
u’amod(parser-6, stanford-5)’,
u’prep_from(is-3, parser-6)’]]

Note that this is different from the default NLTK nltk/parse/stanford.py, we modified some code, and output the tag, parse, and dependency result:

#’-outputFormat’, ‘penn’, # original
‘-outputFormat’, ‘wordsAndTags,penn,typedDependencies’, # modified

Now you can use the Stanford NLP Tools like POS Tagger, NER, and Parser in Python by NLTK, just enjoy it.

Posted by TextMiner


Comments

Dive Into NLTK, Part V: Using Stanford Text Analysis Tools in Python — 6 Comments

  1. Have you figured out how to use the caseless version of the entity recognition? I downloaded http://nlp.stanford.edu/software/stanford-ner-2015-04-20.zip and placed it in the site-packages folder of python. Then I downloaded http://nlp.stanford.edu/software/stanford-corenlp-caseless-2015-04-20-models.jar and placed it in the folder. Then I ran this code in NLTK

    from nltk.tag.stanford import NERTagger
    english_nertagger = NERTagger(‘/home/anaconda/lib/python2.7/site-packages/stanford-ner-2015-04-20/classifiers/english.conll.4class.distsim.crf.ser.gz’, ‘/home/anaconda/lib/python2.7/site-packages/stanford-ner-2015-04-20/stanford-corenlp-caseless-2015-04-20-models.jar’)

    But when I run this:

    english_nertagger.tag(‘Rami Eid is studying at stony brook university in NY’.split())

    I get an error:
    Error: Could not find or load main class edu.stanford.nlp.ie.crf.CRFClassifier

    Any help if you have experience is appreciated!

    Thanks!

    • I had some luck with this:
      stanPath = ‘C:\\Users\\x\\stanford-ner-2014-08-27′
      StanfordNERTagger(os.path.join(stanPath,’classifiers\\english.all.3class.distsim.crf.ser.gz’),
      path_to_jar = os.path.join(stanPath,’stanford-ner.jar’))

      after unzipping to the stanPath folder

  2. there seems to have been some renaming over the time.

    from nltk.tag.stanford import POSTagger
    may not work as it happened with me.

    Instead Try

    from nltk.tag.stanford import StanfordPOSTagger

Leave a Reply

Your email address will not be published. Required fields are marked *