HomeNLTKDive Into NLTK, Part VII: A Preliminary Study on Text Classification

This is the seventh article in the series “Dive Into NLTK“, here is an index of all the articles in the series that have been published to date:

Part I: Getting Started with NLTK
Part II: Sentence Tokenize and Word Tokenize
Part III: Part-Of-Speech Tagging and POS Tagger
Part IV: Stemming and Lemmatization
Part V: Using Stanford Text Analysis Tools in Python
Part VI: Add Stanford Word Segmenter Interface for Python NLTK
Part VII: A Preliminary Study on Text Classification
Part VIII: Using External Maximum Entropy Modeling Libraries for Text Classification
Part IX: From Text Classification to Sentiment Analysis
Part X: Play With Word2Vec Models based on NLTK Corpus

Text Classification is very useful technique in text analysis, such as it can be used in spam filtering, language identification, sentiment analysis, genre classification and etc. According wikipedia, text classification also refer as document classification:

Document classification or document categorization is a problem in library science, information science and computer science. The task is to assign a document to one or more classes or categories. This may be done “manually” (or “intellectually”) or algorithmically. The intellectual classification of documents has mostly been the province of library science, while the algorithmic classification of documents is used mainly in information science and computer science. The problems are overlapping, however, and there is therefore also interdisciplinary research on document classification.

And in this article we focus on the automatic text (document) classification, and if you are not familiar with this technique, we strongly recommend you to learn the Stanford NLP Course in Coursera first, where the week 3 lecture show you what is Text Classification and Naive Bayes Model´╝îWeek 4 lecture show you the Discriminative Model with Maximum Entropy classifiers: Natural Language Processing by Dan Jurafsky, Christopher Manning

Here we will directly dive into NLTK and talk all text classification related things in NLTK. You can find the NLTK Classifier Code in the nltk/nltk/classify directory, by the __init__.py file, we can learn something about the NLTK Classifier interfaces:

# Natural Language Toolkit: Classifiers
#
# Copyright (C) 2001-2014 NLTK Project
# Author: Edward Loper <edloper@gmail.com>
# URL: <http://nltk.org/>
# For license information, see LICENSE.TXT
 
"""
Classes and interfaces for labeling tokens with category labels (or
"class labels").  Typically, labels are represented with strings
(such as ``'health'`` or ``'sports'``).  Classifiers can be used to
perform a wide range of classification tasks.  For example,
classifiers can be used...
 
- to classify documents by topic
- to classify ambiguous words by which word sense is intended
- to classify acoustic signals by which phoneme they represent
- to classify sentences by their author
 
Features
========
In order to decide which category label is appropriate for a given
token, classifiers examine one or more 'features' of the token.  These
"features" are typically chosen by hand, and indicate which aspects
of the token are relevant to the classification decision.  For
example, a document classifier might use a separate feature for each
word, recording how often that word occurred in the document.
 
Featuresets
===========
The features describing a token are encoded using a "featureset",
which is a dictionary that maps from "feature names" to "feature
values".  Feature names are unique strings that indicate what aspect
of the token is encoded by the feature.  Examples include
``'prevword'``, for a feature whose value is the previous word; and
``'contains-word(library)'`` for a feature that is true when a document
contains the word ``'library'``.  Feature values are typically
booleans, numbers, or strings, depending on which feature they
describe.
 
Featuresets are typically constructed using a "feature detector"
(also known as a "feature extractor").  A feature detector is a
function that takes a token (and sometimes information about its
context) as its input, and returns a featureset describing that token.
For example, the following feature detector converts a document
(stored as a list of words) to a featureset describing the set of
words included in the document:
 
    >>> # Define a feature detector function.
    >>> def document_features(document):
    ...     return dict([('contains-word(%s)' % w, True) for w in document])
 
Feature detectors are typically applied to each token before it is fed
to the classifier:
 
    >>> # Classify each Gutenberg document.
    >>> from nltk.corpus import gutenberg
    >>> for fileid in gutenberg.fileids(): # doctest: +SKIP
    ...     doc = gutenberg.words(fileid) # doctest: +SKIP
    ...     print fileid, classifier.classify(document_features(doc)) # doctest: +SKIP
 
The parameters that a feature detector expects will vary, depending on
the task and the needs of the feature detector.  For example, a
feature detector for word sense disambiguation (WSD) might take as its
input a sentence, and the index of a word that should be classified,
and return a featureset for that word.  The following feature detector
for WSD includes features describing the left and right contexts of
the target word:
 
    >>> def wsd_features(sentence, index):
    ...     featureset = {}
    ...     for i in range(max(0, index-3), index):
    ...         featureset['left-context(%s)' % sentence[i]] = True
    ...     for i in range(index, max(index+3, len(sentence))):
    ...         featureset['right-context(%s)' % sentence[i]] = True
    ...     return featureset
 
Training Classifiers
====================
Most classifiers are built by training them on a list of hand-labeled
examples, known as the "training set".  Training sets are represented
as lists of ``(featuredict, label)`` tuples.
"""
 
from nltk.classify.api import ClassifierI, MultiClassifierI
from nltk.classify.megam import config_megam, call_megam
from nltk.classify.weka import WekaClassifier, config_weka
from nltk.classify.naivebayes import NaiveBayesClassifier
from nltk.classify.positivenaivebayes import PositiveNaiveBayesClassifier
from nltk.classify.decisiontree import DecisionTreeClassifier
from nltk.classify.rte_classify import rte_classifier, rte_features, RTEFeatureExtractor
from nltk.classify.util import accuracy, apply_features, log_likelihood
from nltk.classify.scikitlearn import SklearnClassifier
from nltk.classify.maxent import (MaxentClassifier, BinaryMaxentFeatureEncoding,
                                  TypedMaxentFeatureEncoding,
                                  ConditionalExponentialClassifier)

The most basic thing for a supervised text classifier is the labeled category data, which can be used as a training data. As an example, we use the NLTK Name corpus to train a Gender Identification classifier:

In [1]: from nltk.corpus import names
 
In [2]: import random
 
In [3]: names = ([(name, 'male') for name in names.words('male.txt')] + [(name, 'female') for name in names.words('female.txt')])
 
In [4]: random.shuffle(names)
 
In [5]: len(names)
Out[5]: 7944
 
In [6]: names[0:10]
Out[6]: 
[(u'Marthe', 'female'),
 (u'Elana', 'female'),
 (u'Ernie', 'male'),
 (u'Colleen', 'female'),
 (u'Lynde', 'female'),
 (u'Barclay', 'male'),
 (u'Skippy', 'male'),
 (u'Marcelia', 'female'),
 (u'Charlena', 'female'),
 (u'Ronnica', 'female')]

The most important thing for a text classifier is feature, which can be very flexible, and defined by human engineer. Here, we just use the final letter of a given name as the feature, and build a dictionary containing relevant information about a given name:

In [7]: def gender_features(word):
   ...:     return {'last_letter': word[-1]}
   ...: 
 
In [8]: gender_features('Gary')
Out[8]: {'last_letter': 'y'}

The dictionary that is returned by this function is called a feature set and maps from features’ names to their values. Feature set is core part for NLTK Classifier, we can use the feature extractor to extract feature sets for NLTK Classifier and segment them into training set and testing set:

In [9]: featuresets = [(gender_features(n), g) for (n, g) in names]
 
In [10]: len(featuresets)
Out[10]: 7944
 
In [11]: featuresets[0:10]
Out[11]: 
[({'last_letter': u'e'}, 'female'),
 ({'last_letter': u'a'}, 'female'),
 ({'last_letter': u'e'}, 'male'),
 ({'last_letter': u'n'}, 'female'),
 ({'last_letter': u'e'}, 'female'),
 ({'last_letter': u'y'}, 'male'),
 ({'last_letter': u'y'}, 'male'),
 ({'last_letter': u'a'}, 'female'),
 ({'last_letter': u'a'}, 'female'),
 ({'last_letter': u'a'}, 'female')]
 
In [12]: train_set, test_set = featuresets[500:], featuresets[:500]
 
In [13]: len(train_set)
Out[13]: 7444
 
In [14]: len(test_set)
Out[14]: 500

A learning algorithm is very useful for a classifier, here we will show you how to use the Naive Bayes and Maximum Entropy Model to train a NaiveBayes and Maxent Classifier, where Naive Bayes is the Generative Model and Maxent is Discriminative Model.

Here is how to train a Naive Bayes classifier for Gender Identification:

In [16]: from nltk import NaiveBayesClassifier
 
In [17]: nb_classifier = NaiveBayesClassifier.train(train_set)
 
In [18]: nb_classifier.classify(gender_features('Gary'))
Out[18]: 'female'
 
In [19]: nb_classifier.classify(gender_features('Grace'))
Out[19]: 'female'
 
In [20]: from nltk import classify
 
In [21]: classify.accuracy(nb_classifier, test_set)
Out[21]: 0.73
 
In [22]: nb_classifier.show_most_informative_features(5)
Most Informative Features
             last_letter = u'a'           female : male   =     38.4 : 1.0
             last_letter = u'k'             male : female =     33.4 : 1.0
             last_letter = u'f'             male : female =     16.7 : 1.0
             last_letter = u'p'             male : female =     11.9 : 1.0
             last_letter = u'v'             male : female =     10.6 : 1.0

Here is how to train a Maximum Entropy Classifier for Gender Identification:

In [23]: from nltk import MaxentClassifier
 
In [24]: me_classifier = MaxentClassifier.train(train_set)
  ==> Training (100 iterations)
 
      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -0.69315        0.369
             2          -0.37066        0.765
             3          -0.37029        0.765
             4          -0.37007        0.765
             5          -0.36992        0.765
             6          -0.36981        0.765
             7          -0.36973        0.765
             8          -0.36967        0.765
             9          -0.36962        0.765
            10          -0.36958        0.765
            11          -0.36955        0.765
            12          -0.36952        0.765
            13          -0.36949        0.765
            14          -0.36947        0.765
            15          -0.36945        0.765
            16          -0.36944        0.765
            17          -0.36942        0.765
            18          -0.36941        0.765
            ....
In [25]: me_classifier.classify(gender_features('Gary'))
Out[25]: 'female'
 
In [26]: me_classifier.classify(gender_features('Grace'))
Out[26]: 'female'
 
In [27]: classify.accuracy(me_classifier, test_set)
Out[27]: 0.728
 
In [28]: me_classifier.show_most_informative_features(5)
   6.644 last_letter==u'c' and label is 'male'
  -5.082 last_letter==u'a' and label is 'male'
  -3.565 last_letter==u'k' and label is 'female'
  -2.700 last_letter==u'f' and label is 'female'
  -2.248 last_letter==u'p' and label is 'female'

It seems that Naive Bayes and Maxent Model have the same result on this Gender Task, but that’s not true. Choosing right features and deciding how to encode them for the task have an big impact on the performance. Here we define a more complex feature extractor function and train the model again:

In [29]: def gender_features2(name):
   ....:     features = {}
   ....:     features["firstletter"] = name[0].lower()
   ....:     features["lastletter"] = name[-1].lower()
   ....:     for letter in 'abcdefghijklmnopqrstuvwxyz':
   ....:         features["count(%s)" % letter] = name.lower().count(letter)
   ....:         features["has(%s)" % letter] = (letter in name.lower())
   ....:     return features
   ....: 
 
In [30]: gender_features2('Gary')
Out[30]: 
{'count(a)': 1,
 'count(b)': 0,
 'count(c)': 0,
 'count(d)': 0,
 'count(e)': 0,
 'count(f)': 0,
 'count(g)': 1,
 'count(h)': 0,
 'count(i)': 0,
 'count(j)': 0,
 'count(k)': 0,
 'count(l)': 0,
 'count(m)': 0,
 'count(n)': 0,
 'count(o)': 0,
 'count(p)': 0,
 'count(q)': 0,
 'count(r)': 1,
 'count(s)': 0,
 'count(t)': 0,
 'count(u)': 0,
 'count(v)': 0,
 'count(w)': 0,
 'count(x)': 0,
 'count(y)': 1,
 'count(z)': 0,
 'firstletter': 'g',
 'has(a)': True,
 'has(b)': False,
 'has(c)': False,
 'has(d)': False,
 'has(e)': False,
 'has(f)': False,
 'has(g)': True,
 'has(h)': False,
 'has(i)': False,
 'has(j)': False,
 'has(k)': False,
 'has(l)': False,
 'has(m)': False,
 'has(n)': False,
 'has(o)': False,
 'has(p)': False,
 'has(q)': False,
 'has(r)': True,
 'has(s)': False,
 'has(t)': False,
 'has(u)': False,
 'has(v)': False,
 'has(w)': False,
 'has(x)': False,
 'has(y)': True,
 'has(z)': False,
 'lastletter': 'y'}
 
In [32]: featuresets = [(gender_features2(n), g) for (n, g) in names]
 
In [34]: train_set, test_set = featuresets[500:], featuresets[:500]
 
In [35]: nb2_classifier = NaiveBayesClassifier.train(train_set)
 
In [36]: classify.accuracy(nb2_classifier, test_set)
Out[36]: 0.774
 
In [37]: me2_classifier = MaxentClassifier.train(train_set)
  ==> Training (100 iterations)
 
      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -0.69315        0.369
             2          -0.61051        0.631
             3          -0.59637        0.631
             4          -0.58304        0.631
             5          -0.57050        0.637
             6          -0.55872        0.651
             7          -0.54766        0.672
             8          -0.53728        0.689
             ....
            93          -0.33632        0.805
            94          -0.33588        0.805
            95          -0.33545        0.805
            96          -0.33503        0.805
            97          -0.33462        0.805
            98          -0.33421        0.805
            99          -0.33382        0.805
         Final          -0.33343        0.805
 
In [38]: classify.accuracy(me2_classifier, test_set)
Out[38]: 0.78

It seems that more features make Maximum Entropy Model more accuracy, but more slow when training it. We can define the third feature extractor function and train Naive Bayes and Maxent Classifier models again:

In [49]: def gender_features3(name):
    features = {}
    features["fl"] = name[0].lower()
    features["ll"] = name[-1].lower()
    features["fw"] = name[:2].lower()
    features["lw"] = name[-2:].lower()
    return features
 
In [50]: gender_features3('Gary')
Out[50]: {'fl': 'g', 'fw': 'ga', 'll': 'y', 'lw': 'ry'}
 
In [51]: gender_features3('G')
Out[51]: {'fl': 'g', 'fw': 'g', 'll': 'g', 'lw': 'g'}
 
In [52]: gender_features3('Gary')
Out[52]: {'fl': 'g', 'fw': 'ga', 'll': 'y', 'lw': 'ry'}
 
In [53]: featuresets = [(gender_features3(n), g) for (n, g) in names]
 
In [54]: featuresets[0]
Out[54]: ({'fl': u'm', 'fw': u'ma', 'll': u'e', 'lw': u'he'}, 'female')
 
In [55]: len(featuresets)
Out[55]: 7944
 
In [56]: train_set, test_set = featuresets[500:], featuresets[:500]
 
In [57]: nb3_classifier = NaiveBayesClassifier.train(train_set)
 
In [59]: classify.accuracy(nb3_classifier, test_set)
Out[59]: 0.77
In [60]: me3_classifier = MaxentClassifier.train(train_set)
  ==> Training (100 iterations)
 
      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -0.69315        0.369
             2          -0.40398        0.800
             3          -0.34739        0.821
             4          -0.32196        0.825
             5          -0.30766        0.827
             ......
            95          -0.25608        0.839
            96          -0.25605        0.839
            97          -0.25601        0.839
            98          -0.25598        0.839
            99          -0.25595        0.839
         Final          -0.25591        0.839
 
In [61]: classify.accuracy(me3_classifier, test_set)
Out[61]: 0.798

It seems that with a proper feature extraction, Maximum Entropy Classifier can get better performance on the test set. Actually, sometimes selecting right feature is more important in a supervised text classification. You need spend a lot of time to choose features and select a good learning algorithm with parameters adjustment for your text classifier.

Our preliminary study on text classification in NLTK is over, and next chapter we will dive into text classification with a useful example.

Posted by TextMiner


Comments

Dive Into NLTK, Part VII: A Preliminary Study on Text Classification — 1 Comment

Leave a Reply

Your email address will not be published. Required fields are marked *