Dive Into NLTK, Part VIII: Using External Maximum Entropy Modeling Libraries for Text Classification
This is the eighth article in the series “Dive Into NLTK“, here is an index of all the articles in the series that have been published to date:
Part I: Getting Started with NLTK
Part II: Sentence Tokenize and Word Tokenize
Part III: Part-Of-Speech Tagging and POS Tagger
Part IV: Stemming and Lemmatization
Part V: Using Stanford Text Analysis Tools in Python
Part VI: Add Stanford Word Segmenter Interface for Python NLTK
Part VII: A Preliminary Study on Text Classification
Part VIII: Using External Maximum Entropy Modeling Libraries for Text Classification
Part IX: From Text Classification to Sentiment Analysis
Part X: Play With Word2Vec Models based on NLTK Corpus
Maximum entropy modeling, also known as Multinomial logistic regression, is one of the most popular framework for text analysis tasks since first introduced into the NLP area by Berger and Della Pietra at 1996. A lot of Maximum entropy model tools and libraries are implemented by several programming languages since then, and you can find a complete list of Maximum entropy models related software by this website which maintained by Doctor Le Zhang: Maximum Entropy Modeling, which also contains very useful materials for maximum entropy models.
NLTK provides several learning algorithms for text classification, such as naive bayes, decision trees, and also includes maximum entropy models, you can find them all in the nltk/classify module. For Maximum entropy modeling, you can find the details in the maxent.py:
# Natural Language Toolkit: Maximum Entropy Classifiers # # Copyright (C) 2001-2014 NLTK Project # Author: Edward Loper <edloper@gmail.com> # Dmitry Chichkov <dchichkov@gmail.com> (TypedMaxentFeatureEncoding) # URL: <http://nltk.org/> # For license information, see LICENSE.TXT """ A classifier model based on maximum entropy modeling framework. This framework considers all of the probability distributions that are empirically consistent with the training data; and chooses the distribution with the highest entropy. A probability distribution is "empirically consistent" with a set of training data if its estimated frequency with which a class and a feature vector value co-occur is equal to the actual frequency in the data. Terminology: 'feature' ====================== The term *feature* is usually used to refer to some property of an unlabeled token. For example, when performing word sense disambiguation, we might define a ``'prevword'`` feature whose value is the word preceding the target word. However, in the context of maxent modeling, the term *feature* is typically used to refer to a property of a "labeled" token. In order to prevent confusion, we will introduce two distinct terms to disambiguate these two different concepts: - An "input-feature" is a property of an unlabeled token. - A "joint-feature" is a property of a labeled token. In the rest of the ``nltk.classify`` module, the term "features" is used to refer to what we will call "input-features" in this module. In literature that describes and discusses maximum entropy models, input-features are typically called "contexts", and joint-features are simply referred to as "features". Converting Input-Features to Joint-Features ------------------------------------------- In maximum entropy models, joint-features are required to have numeric values. Typically, each input-feature ``input_feat`` is mapped to a set of joint-features of the form: | joint_feat(token, label) = { 1 if input_feat(token) == feat_val | { and label == some_label | { | { 0 otherwise For all values of ``feat_val`` and ``some_label``. This mapping is performed by classes that implement the ``MaxentFeatureEncodingI`` interface. """ from __future__ import print_function, unicode_literals __docformat__ = 'epytext en' try: import numpy except ImportError: pass import time import tempfile import os import gzip from collections import defaultdict from nltk import compat from nltk.data import gzip_open_unicode from nltk.util import OrderedDict from nltk.probability import DictionaryProbDist from nltk.classify.api import ClassifierI from nltk.classify.util import CutoffChecker, accuracy, log_likelihood from nltk.classify.megam import call_megam, write_megam_file, parse_megam_weights from nltk.classify.tadm import call_tadm, write_tadm_file, parse_tadm_weights ... |
Within NLTK, the Maxent training algorithms support GIS(Generalized Iterative Scaling), IIS(Improved Iterative Scaling), and LM-BFGS. The first two are implemented in NLTK by Python but seems very slow and costs large memory for the same training data. And the LBFGS algorithm, support external libraries like MEGAM(MEGA Model Optimization Package), which is very elegant.
MEGAM is based on the OCaml system, which is the main implementation of the Caml language. Caml is a general-purpose programming language, designed with program safety and reliability in mind. In order use MEGAM in your system, you need install OCaml first. In my Ubuntu 12.04 VPS, it’s very easy to install the latest ocaml version of 4.02:
wget http://caml.inria.fr/pub/distrib/ocaml-4.02/ocaml-4.02.1.tar.gz
tar -zxvf ocaml-4.02.1.tar.gz
./configure
make world.opt
sudo make install
After install Ocaml, it’s time to install Megam:
wget http://hal3.name/megam/megam_src.tgz
tar -zxvf megam_src.tgz
cd megam_0.92
From README, install megam is very easy:
To build a safe but slow version, just execute:
make
which will produce an executable megam, unless something goes wrong.
To build a fast but not so safe version, execuate
make opt
which will produce an executable megam.opt that will be much much
faster. If you encounter any bugs, please let me know (if something
crashes, it’s probably easiest to switch to the safe by slow version,
run it, and let me know what the error message is).
But when we execute the “make” first, we met the error like this:
…
ocamlc -g -custom -o megam str.cma -cclib -lstr bigarray.cma -cclib -lbigarray unix.cma -cclib -lunix -I /usr/lib/ocaml/caml fastdot_c.c fastdot.cmo intHashtbl.cmo arry.cmo util.cmo data.cmo bitvec.cmo cg.cmo wsemlm.cmo bfgs.cmo pa.cmo perceptron.cmo radapt.cmo kernelmap.cmo abffs.cmo main.cmo
fastdot_c.c:4:19: fatal error: alloc.h: No such file or directory
…
Here you should use the “ocamlc -where” to find the right ocmal library: /usr/local/lib/ocaml , and then edit the Makefile 74 line (Note that this editing is on my Ubuntu 12.04 VPS):
#WITHCLIBS =-I /usr/lib/ocaml/caml
WITHCLIBS =-I /usr/local/lib/ocaml/caml
Then execute the “make” again, but met another problem:
…
ocamlc -g -custom -o megam str.cma -cclib -lstr bigarray.cma -cclib -lbigarray unix.cma -cclib -lunix -I /usr/local/lib/ocaml/caml fastdot_c.c fastdot.cmo intHashtbl.cmo arry.cmo util.cmo data.cmo bitvec.cmo cg.cmo wsemlm.cmo bfgs.cmo pa.cmo perceptron.cmo radapt.cmo kernelmap.cmo abffs.cmo main.cmo
/usr/bin/ld: cannot find -lstr
”’
Here you should edit the Makefile again, changed the 62 line -lstr to -lcamlstr:
#WITHSTR =str.cma -cclib -lstr
WITHSTR =str.cma -cclib -lcamlstr
Then you can type “make” make a slower version of executable file “megam” and type “make opt” to get a faster version of executable file “megam.opt” in the Makefile directory.
But that’s not the end, if you want use it in the NLTK, you should tell the NLTK where to find the “megam” or “megam.opt” binary, NLTK use a config_megam to lookup the binary version:
_megam_bin = None def config_megam(bin=None): """ Configure NLTK's interface to the ``megam`` maxent optimization package. :param bin: The full path to the ``megam`` binary. If not specified, then nltk will search the system for a ``megam`` binary; and if one is not found, it will raise a ``LookupError`` exception. :type bin: str """ global _megam_bin _megam_bin = find_binary( 'megam', bin, env_vars=['MEGAM'], binary_names=['megam.opt', 'megam', 'megam_686', 'megam_i686.opt'], url='http://www.umiacs.umd.edu/~hal/megam/index.html') |
The best way to do this to copy the binary version megam into the system binary path like /usr/bin or /usr/local/bin, then you never need use the to config_megam every time you use it for Maximum Entropy Modeling in the NLTK.
sudo cp megam /usr/local/bin/
sudo cp megam.opt /usr/local/bin/
Just use it like this:
In [1]: import random In [2]: from nltk.corpus import names In [3]: from nltk import MaxentClassifier In [5]: from nltk import classify In [7]: names = ([(name, 'male') for name in names.words('male.txt')] + [(name, 'female') for name in names.words('female.txt')]) In [8]: random.shuffle(names) In [10]: def gender_features3(name): features = {} features["fl"] = name[0].lower() features["ll"] = name[-1].lower() features["fw"] = name[:2].lower() features["lw"] = name[-2:].lower() return features ....: In [11]: featuresets = [(gender_features3(n), g) for (n, g) in names] In [12]: train_set, test_set = featuresets[500:], featuresets[:500] In [17]: me3_megam_classifier = MaxentClassifier.train(train_set, "megam") [Found megam: megam] Scanning file...7444 train, 0 dev, 0 test, reading...done Warning: there only appear to be two classes, but we're optimizing with BFGS...using binary optimization with CG would be much faster optimizing with lambda = 0 it 1 dw 4.640e-01 pp 6.38216e-01 er 0.37413 it 2 dw 2.065e-01 pp 5.74892e-01 er 0.37413 it 3 dw 3.503e-01 pp 5.43226e-01 er 0.24328 it 4 dw 1.209e-01 pp 5.29406e-01 er 0.22394 it 5 dw 4.864e-01 pp 5.27097e-01 er 0.26115 it 6 dw 5.765e-01 pp 4.92409e-01 er 0.23415 it 7 dw 0.000e+00 pp 4.92409e-01 er 0.23415 ------------------------- it 1 dw 1.802e-01 pp 4.74930e-01 er 0.21588 it 2 dw 3.478e-02 pp 4.70876e-01 er 0.21548 it 3 dw 1.963e-01 pp 4.61761e-01 er 0.21709 it 4 dw 9.624e-02 pp 4.56257e-01 er 0.21574 it 5 dw 3.442e-01 pp 4.54401e-01 ...... it 10 dw 2.399e-03 pp 3.71020e-01 er 0.16967 it 11 dw 2.202e-02 pp 3.71017e-01 er 0.16980 it 12 dw 0.000e+00 pp 3.71017e-01 er 0.16980 ------------------------- it 1 dw 2.620e-02 pp 3.70816e-01 er 0.17020 it 2 dw 2.285e-02 pp 3.70721e-01 er 0.16953 it 3 dw 1.074e-02 pp 3.70631e-01 er 0.16980 it 4 dw 3.152e-02 pp 3.70580e-01 er 0.16994 it 5 dw 2.263e-02 pp 3.70504e-01 er 0.16940 it 6 dw 1.115e-01 pp 3.70370e-01 er 0.16886 it 7 dw 1.938e-01 pp 3.70318e-01 er 0.16913 it 8 dw 1.365e-01 pp 3.69815e-01 er 0.16940 it 9 dw 2.634e-01 pp 3.69366e-01 er 0.16873 it 10 dw 2.498e-01 pp 3.69290e-01 er 0.17007 it 11 dw 2.515e-01 pp 3.69240e-01 er 0.16994 it 12 dw 3.027e-01 pp 3.69234e-01 er 0.16994 it 13 dw 9.850e-03 pp 3.69233e-01 er 0.16994 it 14 dw 1.152e-01 pp 3.69214e-01 er 0.16994 it 15 dw 0.000e+00 pp 3.69214e-01 er 0.16994 In [18]: classify.accuracy(me3_megam_classifier, test_set) Out[18]: 0.812 |
The Megam using in NLTK depends the python subprocess.Popen, and it’s very fast and costs fewer resources than the original GIS or IIS Maxent module in NLTK. I have also compiled it in my local mac pro, and meet the same problems like the linux ubuntu compile process. You can also reference this article to compile the Megam in MAC OS: Compiling MegaM on MacOS X.
That’s the end, just enjoy using Megam in your NLTK or Python Projet.
Posted by TextMiner
Hi, can you explain the output of megam classifier. Why there are so many iterations.
It’s training iterations which lower the cost.