Dive Into NLTK, Part VII: A Preliminary Study on Text Classification
This is the seventh article in the series “Dive Into NLTK“, here is an index of all the articles in the series that have been published to date:
Part I: Getting Started with NLTK
Part II: Sentence Tokenize and Word Tokenize
Part III: Part-Of-Speech Tagging and POS Tagger
Part IV: Stemming and Lemmatization
Part V: Using Stanford Text Analysis Tools in Python
Part VI: Add Stanford Word Segmenter Interface for Python NLTK
Part VII: A Preliminary Study on Text Classification
Part VIII: Using External Maximum Entropy Modeling Libraries for Text Classification
Part IX: From Text Classification to Sentiment Analysis
Part X: Play With Word2Vec Models based on NLTK Corpus
Text Classification is very useful technique in text analysis, such as it can be used in spam filtering, language identification, sentiment analysis, genre classification and etc. According wikipedia, text classification also refer as document classification:
Document classification or document categorization is a problem in library science, information science and computer science. The task is to assign a document to one or more classes or categories. This may be done “manually” (or “intellectually”) or algorithmically. The intellectual classification of documents has mostly been the province of library science, while the algorithmic classification of documents is used mainly in information science and computer science. The problems are overlapping, however, and there is therefore also interdisciplinary research on document classification.
And in this article we focus on the automatic text (document) classification, and if you are not familiar with this technique, we strongly recommend you to learn the Stanford NLP Course in Coursera first, where the week 3 lecture show you what is Text Classification and Naive Bayes Model,Week 4 lecture show you the Discriminative Model with Maximum Entropy classifiers: Natural Language Processing by Dan Jurafsky, Christopher Manning
Here we will directly dive into NLTK and talk all text classification related things in NLTK. You can find the NLTK Classifier Code in the nltk/nltk/classify directory, by the __init__.py file, we can learn something about the NLTK Classifier interfaces:
# Natural Language Toolkit: Classifiers # # Copyright (C) 2001-2014 NLTK Project # Author: Edward Loper <edloper@gmail.com> # URL: <http://nltk.org/> # For license information, see LICENSE.TXT """ Classes and interfaces for labeling tokens with category labels (or "class labels"). Typically, labels are represented with strings (such as ``'health'`` or ``'sports'``). Classifiers can be used to perform a wide range of classification tasks. For example, classifiers can be used... - to classify documents by topic - to classify ambiguous words by which word sense is intended - to classify acoustic signals by which phoneme they represent - to classify sentences by their author Features ======== In order to decide which category label is appropriate for a given token, classifiers examine one or more 'features' of the token. These "features" are typically chosen by hand, and indicate which aspects of the token are relevant to the classification decision. For example, a document classifier might use a separate feature for each word, recording how often that word occurred in the document. Featuresets =========== The features describing a token are encoded using a "featureset", which is a dictionary that maps from "feature names" to "feature values". Feature names are unique strings that indicate what aspect of the token is encoded by the feature. Examples include ``'prevword'``, for a feature whose value is the previous word; and ``'contains-word(library)'`` for a feature that is true when a document contains the word ``'library'``. Feature values are typically booleans, numbers, or strings, depending on which feature they describe. Featuresets are typically constructed using a "feature detector" (also known as a "feature extractor"). A feature detector is a function that takes a token (and sometimes information about its context) as its input, and returns a featureset describing that token. For example, the following feature detector converts a document (stored as a list of words) to a featureset describing the set of words included in the document: >>> # Define a feature detector function. >>> def document_features(document): ... return dict([('contains-word(%s)' % w, True) for w in document]) Feature detectors are typically applied to each token before it is fed to the classifier: >>> # Classify each Gutenberg document. >>> from nltk.corpus import gutenberg >>> for fileid in gutenberg.fileids(): # doctest: +SKIP ... doc = gutenberg.words(fileid) # doctest: +SKIP ... print fileid, classifier.classify(document_features(doc)) # doctest: +SKIP The parameters that a feature detector expects will vary, depending on the task and the needs of the feature detector. For example, a feature detector for word sense disambiguation (WSD) might take as its input a sentence, and the index of a word that should be classified, and return a featureset for that word. The following feature detector for WSD includes features describing the left and right contexts of the target word: >>> def wsd_features(sentence, index): ... featureset = {} ... for i in range(max(0, index-3), index): ... featureset['left-context(%s)' % sentence[i]] = True ... for i in range(index, max(index+3, len(sentence))): ... featureset['right-context(%s)' % sentence[i]] = True ... return featureset Training Classifiers ==================== Most classifiers are built by training them on a list of hand-labeled examples, known as the "training set". Training sets are represented as lists of ``(featuredict, label)`` tuples. """ from nltk.classify.api import ClassifierI, MultiClassifierI from nltk.classify.megam import config_megam, call_megam from nltk.classify.weka import WekaClassifier, config_weka from nltk.classify.naivebayes import NaiveBayesClassifier from nltk.classify.positivenaivebayes import PositiveNaiveBayesClassifier from nltk.classify.decisiontree import DecisionTreeClassifier from nltk.classify.rte_classify import rte_classifier, rte_features, RTEFeatureExtractor from nltk.classify.util import accuracy, apply_features, log_likelihood from nltk.classify.scikitlearn import SklearnClassifier from nltk.classify.maxent import (MaxentClassifier, BinaryMaxentFeatureEncoding, TypedMaxentFeatureEncoding, ConditionalExponentialClassifier) |
The most basic thing for a supervised text classifier is the labeled category data, which can be used as a training data. As an example, we use the NLTK Name corpus to train a Gender Identification classifier:
In [1]: from nltk.corpus import names In [2]: import random In [3]: names = ([(name, 'male') for name in names.words('male.txt')] + [(name, 'female') for name in names.words('female.txt')]) In [4]: random.shuffle(names) In [5]: len(names) Out[5]: 7944 In [6]: names[0:10] Out[6]: [(u'Marthe', 'female'), (u'Elana', 'female'), (u'Ernie', 'male'), (u'Colleen', 'female'), (u'Lynde', 'female'), (u'Barclay', 'male'), (u'Skippy', 'male'), (u'Marcelia', 'female'), (u'Charlena', 'female'), (u'Ronnica', 'female')] |
The most important thing for a text classifier is feature, which can be very flexible, and defined by human engineer. Here, we just use the final letter of a given name as the feature, and build a dictionary containing relevant information about a given name:
In [7]: def gender_features(word): ...: return {'last_letter': word[-1]} ...: In [8]: gender_features('Gary') Out[8]: {'last_letter': 'y'} |
The dictionary that is returned by this function is called a feature set and maps from features’ names to their values. Feature set is core part for NLTK Classifier, we can use the feature extractor to extract feature sets for NLTK Classifier and segment them into training set and testing set:
In [9]: featuresets = [(gender_features(n), g) for (n, g) in names] In [10]: len(featuresets) Out[10]: 7944 In [11]: featuresets[0:10] Out[11]: [({'last_letter': u'e'}, 'female'), ({'last_letter': u'a'}, 'female'), ({'last_letter': u'e'}, 'male'), ({'last_letter': u'n'}, 'female'), ({'last_letter': u'e'}, 'female'), ({'last_letter': u'y'}, 'male'), ({'last_letter': u'y'}, 'male'), ({'last_letter': u'a'}, 'female'), ({'last_letter': u'a'}, 'female'), ({'last_letter': u'a'}, 'female')] In [12]: train_set, test_set = featuresets[500:], featuresets[:500] In [13]: len(train_set) Out[13]: 7444 In [14]: len(test_set) Out[14]: 500 |
A learning algorithm is very useful for a classifier, here we will show you how to use the Naive Bayes and Maximum Entropy Model to train a NaiveBayes and Maxent Classifier, where Naive Bayes is the Generative Model and Maxent is Discriminative Model.
Here is how to train a Naive Bayes classifier for Gender Identification:
In [16]: from nltk import NaiveBayesClassifier In [17]: nb_classifier = NaiveBayesClassifier.train(train_set) In [18]: nb_classifier.classify(gender_features('Gary')) Out[18]: 'female' In [19]: nb_classifier.classify(gender_features('Grace')) Out[19]: 'female' In [20]: from nltk import classify In [21]: classify.accuracy(nb_classifier, test_set) Out[21]: 0.73 In [22]: nb_classifier.show_most_informative_features(5) Most Informative Features last_letter = u'a' female : male = 38.4 : 1.0 last_letter = u'k' male : female = 33.4 : 1.0 last_letter = u'f' male : female = 16.7 : 1.0 last_letter = u'p' male : female = 11.9 : 1.0 last_letter = u'v' male : female = 10.6 : 1.0 |
Here is how to train a Maximum Entropy Classifier for Gender Identification:
In [23]: from nltk import MaxentClassifier In [24]: me_classifier = MaxentClassifier.train(train_set) ==> Training (100 iterations) Iteration Log Likelihood Accuracy --------------------------------------- 1 -0.69315 0.369 2 -0.37066 0.765 3 -0.37029 0.765 4 -0.37007 0.765 5 -0.36992 0.765 6 -0.36981 0.765 7 -0.36973 0.765 8 -0.36967 0.765 9 -0.36962 0.765 10 -0.36958 0.765 11 -0.36955 0.765 12 -0.36952 0.765 13 -0.36949 0.765 14 -0.36947 0.765 15 -0.36945 0.765 16 -0.36944 0.765 17 -0.36942 0.765 18 -0.36941 0.765 .... In [25]: me_classifier.classify(gender_features('Gary')) Out[25]: 'female' In [26]: me_classifier.classify(gender_features('Grace')) Out[26]: 'female' In [27]: classify.accuracy(me_classifier, test_set) Out[27]: 0.728 In [28]: me_classifier.show_most_informative_features(5) 6.644 last_letter==u'c' and label is 'male' -5.082 last_letter==u'a' and label is 'male' -3.565 last_letter==u'k' and label is 'female' -2.700 last_letter==u'f' and label is 'female' -2.248 last_letter==u'p' and label is 'female' |
It seems that Naive Bayes and Maxent Model have the same result on this Gender Task, but that’s not true. Choosing right features and deciding how to encode them for the task have an big impact on the performance. Here we define a more complex feature extractor function and train the model again:
In [29]: def gender_features2(name): ....: features = {} ....: features["firstletter"] = name[0].lower() ....: features["lastletter"] = name[-1].lower() ....: for letter in 'abcdefghijklmnopqrstuvwxyz': ....: features["count(%s)" % letter] = name.lower().count(letter) ....: features["has(%s)" % letter] = (letter in name.lower()) ....: return features ....: In [30]: gender_features2('Gary') Out[30]: {'count(a)': 1, 'count(b)': 0, 'count(c)': 0, 'count(d)': 0, 'count(e)': 0, 'count(f)': 0, 'count(g)': 1, 'count(h)': 0, 'count(i)': 0, 'count(j)': 0, 'count(k)': 0, 'count(l)': 0, 'count(m)': 0, 'count(n)': 0, 'count(o)': 0, 'count(p)': 0, 'count(q)': 0, 'count(r)': 1, 'count(s)': 0, 'count(t)': 0, 'count(u)': 0, 'count(v)': 0, 'count(w)': 0, 'count(x)': 0, 'count(y)': 1, 'count(z)': 0, 'firstletter': 'g', 'has(a)': True, 'has(b)': False, 'has(c)': False, 'has(d)': False, 'has(e)': False, 'has(f)': False, 'has(g)': True, 'has(h)': False, 'has(i)': False, 'has(j)': False, 'has(k)': False, 'has(l)': False, 'has(m)': False, 'has(n)': False, 'has(o)': False, 'has(p)': False, 'has(q)': False, 'has(r)': True, 'has(s)': False, 'has(t)': False, 'has(u)': False, 'has(v)': False, 'has(w)': False, 'has(x)': False, 'has(y)': True, 'has(z)': False, 'lastletter': 'y'} In [32]: featuresets = [(gender_features2(n), g) for (n, g) in names] In [34]: train_set, test_set = featuresets[500:], featuresets[:500] In [35]: nb2_classifier = NaiveBayesClassifier.train(train_set) In [36]: classify.accuracy(nb2_classifier, test_set) Out[36]: 0.774 In [37]: me2_classifier = MaxentClassifier.train(train_set) ==> Training (100 iterations) Iteration Log Likelihood Accuracy --------------------------------------- 1 -0.69315 0.369 2 -0.61051 0.631 3 -0.59637 0.631 4 -0.58304 0.631 5 -0.57050 0.637 6 -0.55872 0.651 7 -0.54766 0.672 8 -0.53728 0.689 .... 93 -0.33632 0.805 94 -0.33588 0.805 95 -0.33545 0.805 96 -0.33503 0.805 97 -0.33462 0.805 98 -0.33421 0.805 99 -0.33382 0.805 Final -0.33343 0.805 In [38]: classify.accuracy(me2_classifier, test_set) Out[38]: 0.78 |
It seems that more features make Maximum Entropy Model more accuracy, but more slow when training it. We can define the third feature extractor function and train Naive Bayes and Maxent Classifier models again:
In [49]: def gender_features3(name): features = {} features["fl"] = name[0].lower() features["ll"] = name[-1].lower() features["fw"] = name[:2].lower() features["lw"] = name[-2:].lower() return features In [50]: gender_features3('Gary') Out[50]: {'fl': 'g', 'fw': 'ga', 'll': 'y', 'lw': 'ry'} In [51]: gender_features3('G') Out[51]: {'fl': 'g', 'fw': 'g', 'll': 'g', 'lw': 'g'} In [52]: gender_features3('Gary') Out[52]: {'fl': 'g', 'fw': 'ga', 'll': 'y', 'lw': 'ry'} In [53]: featuresets = [(gender_features3(n), g) for (n, g) in names] In [54]: featuresets[0] Out[54]: ({'fl': u'm', 'fw': u'ma', 'll': u'e', 'lw': u'he'}, 'female') In [55]: len(featuresets) Out[55]: 7944 In [56]: train_set, test_set = featuresets[500:], featuresets[:500] In [57]: nb3_classifier = NaiveBayesClassifier.train(train_set) In [59]: classify.accuracy(nb3_classifier, test_set) Out[59]: 0.77 In [60]: me3_classifier = MaxentClassifier.train(train_set) ==> Training (100 iterations) Iteration Log Likelihood Accuracy --------------------------------------- 1 -0.69315 0.369 2 -0.40398 0.800 3 -0.34739 0.821 4 -0.32196 0.825 5 -0.30766 0.827 ...... 95 -0.25608 0.839 96 -0.25605 0.839 97 -0.25601 0.839 98 -0.25598 0.839 99 -0.25595 0.839 Final -0.25591 0.839 In [61]: classify.accuracy(me3_classifier, test_set) Out[61]: 0.798 |
It seems that with a proper feature extraction, Maximum Entropy Classifier can get better performance on the test set. Actually, sometimes selecting right feature is more important in a supervised text classification. You need spend a lot of time to choose features and select a good learning algorithm with parameters adjustment for your text classifier.
Our preliminary study on text classification in NLTK is over, and next chapter we will dive into text classification with a useful example.
Posted by TextMiner
Very useful