HomeNLPDive Into NLTK, Part VIII: Using External Maximum Entropy Modeling Libraries for Text Classification
Deep Learning Specialization on Coursera

This is the eighth article in the series “Dive Into NLTK“, here is an index of all the articles in the series that have been published to date:

Part I: Getting Started with NLTK
Part II: Sentence Tokenize and Word Tokenize
Part III: Part-Of-Speech Tagging and POS Tagger
Part IV: Stemming and Lemmatization
Part V: Using Stanford Text Analysis Tools in Python
Part VI: Add Stanford Word Segmenter Interface for Python NLTK
Part VII: A Preliminary Study on Text Classification
Part VIII: Using External Maximum Entropy Modeling Libraries for Text Classification
Part IX: From Text Classification to Sentiment Analysis
Part X: Play With Word2Vec Models based on NLTK Corpus

Maximum entropy modeling, also known as Multinomial logistic regression, is one of the most popular framework for text analysis tasks since first introduced into the NLP area by Berger and Della Pietra at 1996. A lot of Maximum entropy model tools and libraries are implemented by several programming languages since then, and you can find a complete list of Maximum entropy models related software by this website which maintained by Doctor Le Zhang: Maximum Entropy Modeling, which also contains very useful materials for maximum entropy models.

NLTK provides several learning algorithms for text classification, such as naive bayes, decision trees, and also includes maximum entropy models, you can find them all in the nltk/classify module. For Maximum entropy modeling, you can find the details in the maxent.py:

# Natural Language Toolkit: Maximum Entropy Classifiers
# Copyright (C) 2001-2014 NLTK Project
# Author: Edward Loper <edloper@gmail.com>
#         Dmitry Chichkov <dchichkov@gmail.com> (TypedMaxentFeatureEncoding)
# URL: <http://nltk.org/>
# For license information, see LICENSE.TXT
A classifier model based on maximum entropy modeling framework.  This
framework considers all of the probability distributions that are
empirically consistent with the training data; and chooses the
distribution with the highest entropy.  A probability distribution is
"empirically consistent" with a set of training data if its estimated
frequency with which a class and a feature vector value co-occur is
equal to the actual frequency in the data.
Terminology: 'feature'
The term *feature* is usually used to refer to some property of an
unlabeled token.  For example, when performing word sense
disambiguation, we might define a ``'prevword'`` feature whose value is
the word preceding the target word.  However, in the context of
maxent modeling, the term *feature* is typically used to refer to a
property of a "labeled" token.  In order to prevent confusion, we
will introduce two distinct terms to disambiguate these two different
  - An "input-feature" is a property of an unlabeled token.
  - A "joint-feature" is a property of a labeled token.
In the rest of the ``nltk.classify`` module, the term "features" is
used to refer to what we will call "input-features" in this module.
In literature that describes and discusses maximum entropy models,
input-features are typically called "contexts", and joint-features
are simply referred to as "features".
Converting Input-Features to Joint-Features
In maximum entropy models, joint-features are required to have numeric
values.  Typically, each input-feature ``input_feat`` is mapped to a
set of joint-features of the form:
|   joint_feat(token, label) = { 1 if input_feat(token) == feat_val
|                              {      and label == some_label
|                              {
|                              { 0 otherwise
For all values of ``feat_val`` and ``some_label``.  This mapping is
performed by classes that implement the ``MaxentFeatureEncodingI``
from __future__ import print_function, unicode_literals
__docformat__ = 'epytext en'
    import numpy
except ImportError:
import time
import tempfile
import os
import gzip
from collections import defaultdict
from nltk import compat
from nltk.data import gzip_open_unicode
from nltk.util import OrderedDict
from nltk.probability import DictionaryProbDist
from nltk.classify.api import ClassifierI
from nltk.classify.util import CutoffChecker, accuracy, log_likelihood
from nltk.classify.megam import call_megam, write_megam_file, parse_megam_weights
from nltk.classify.tadm import call_tadm, write_tadm_file, parse_tadm_weights

Within NLTK, the Maxent training algorithms support GIS(Generalized Iterative Scaling), IIS(Improved Iterative Scaling), and LM-BFGS. The first two are implemented in NLTK by Python but seems very slow and costs large memory for the same training data. And the LBFGS algorithm, support external libraries like MEGAM(MEGA Model Optimization Package), which is very elegant.

MEGAM is based on the OCaml system, which is the main implementation of the Caml language. Caml is a general-purpose programming language, designed with program safety and reliability in mind. In order use MEGAM in your system, you need install OCaml first. In my Ubuntu 12.04 VPS, it’s very easy to install the latest ocaml version of 4.02:

wget http://caml.inria.fr/pub/distrib/ocaml-4.02/ocaml-4.02.1.tar.gz
tar -zxvf ocaml-4.02.1.tar.gz
make world.opt
sudo make install

After install Ocaml, it’s time to install Megam:

wget http://hal3.name/megam/megam_src.tgz
tar -zxvf megam_src.tgz
cd megam_0.92

From README, install megam is very easy:

To build a safe but slow version, just execute:


which will produce an executable megam, unless something goes wrong.

To build a fast but not so safe version, execuate

make opt

which will produce an executable megam.opt that will be much much
faster. If you encounter any bugs, please let me know (if something
crashes, it’s probably easiest to switch to the safe by slow version,
run it, and let me know what the error message is).

But when we execute the “make” first, we met the error like this:

ocamlc -g -custom -o megam str.cma -cclib -lstr bigarray.cma -cclib -lbigarray unix.cma -cclib -lunix -I /usr/lib/ocaml/caml fastdot_c.c fastdot.cmo intHashtbl.cmo arry.cmo util.cmo data.cmo bitvec.cmo cg.cmo wsemlm.cmo bfgs.cmo pa.cmo perceptron.cmo radapt.cmo kernelmap.cmo abffs.cmo main.cmo
fastdot_c.c:4:19: fatal error: alloc.h: No such file or directory

Here you should use the “ocamlc -where” to find the right ocmal library: /usr/local/lib/ocaml , and then edit the Makefile 74 line (Note that this editing is on my Ubuntu 12.04 VPS):

#WITHCLIBS =-I /usr/lib/ocaml/caml
WITHCLIBS =-I /usr/local/lib/ocaml/caml

Then execute the “make” again, but met another problem:

ocamlc -g -custom -o megam str.cma -cclib -lstr bigarray.cma -cclib -lbigarray unix.cma -cclib -lunix -I /usr/local/lib/ocaml/caml fastdot_c.c fastdot.cmo intHashtbl.cmo arry.cmo util.cmo data.cmo bitvec.cmo cg.cmo wsemlm.cmo bfgs.cmo pa.cmo perceptron.cmo radapt.cmo kernelmap.cmo abffs.cmo main.cmo
/usr/bin/ld: cannot find -lstr

Here you should edit the Makefile again, changed the 62 line -lstr to -lcamlstr:

#WITHSTR =str.cma -cclib -lstr
WITHSTR =str.cma -cclib -lcamlstr

Then you can type “make” make a slower version of executable file “megam” and type “make opt” to get a faster version of executable file “megam.opt” in the Makefile directory.

But that’s not the end, if you want use it in the NLTK, you should tell the NLTK where to find the “megam” or “megam.opt” binary, NLTK use a config_megam to lookup the binary version:

_megam_bin = None
def config_megam(bin=None):
    Configure NLTK's interface to the ``megam`` maxent optimization
    :param bin: The full path to the ``megam`` binary.  If not specified,
        then nltk will search the system for a ``megam`` binary; and if
        one is not found, it will raise a ``LookupError`` exception.
    :type bin: str
    global _megam_bin
    _megam_bin = find_binary(
        'megam', bin,
        binary_names=['megam.opt', 'megam', 'megam_686', 'megam_i686.opt'],

The best way to do this to copy the binary version megam into the system binary path like /usr/bin or /usr/local/bin, then you never need use the to config_megam every time you use it for Maximum Entropy Modeling in the NLTK.

sudo cp megam /usr/local/bin/
sudo cp megam.opt /usr/local/bin/

Just use it like this:

In [1]: import random
In [2]: from nltk.corpus import names
In [3]: from nltk import MaxentClassifier
In [5]: from nltk import classify
In [7]: names = ([(name, 'male') for name in names.words('male.txt')] + [(name, 'female') for name in names.words('female.txt')])
In [8]: random.shuffle(names)
In [10]: def gender_features3(name):
        features = {}
        features["fl"] = name[0].lower()
        features["ll"] = name[-1].lower()
        features["fw"] = name[:2].lower()
        features["lw"] = name[-2:].lower()
        return features
In [11]: featuresets = [(gender_features3(n), g) for (n, g) in names]
In [12]: train_set, test_set = featuresets[500:], featuresets[:500]
In [17]: me3_megam_classifier = MaxentClassifier.train(train_set, "megam")
[Found megam: megam]
Scanning file...7444 train, 0 dev, 0 test, reading...done
Warning: there only appear to be two classes, but we're
         optimizing with BFGS...using binary optimization
         with CG would be much faster
optimizing with lambda = 0
it 1   dw 4.640e-01 pp 6.38216e-01 er 0.37413
it 2   dw 2.065e-01 pp 5.74892e-01 er 0.37413
it 3   dw 3.503e-01 pp 5.43226e-01 er 0.24328
it 4   dw 1.209e-01 pp 5.29406e-01 er 0.22394
it 5   dw 4.864e-01 pp 5.27097e-01 er 0.26115
it 6   dw 5.765e-01 pp 4.92409e-01 er 0.23415
it 7   dw 0.000e+00 pp 4.92409e-01 er 0.23415
it 1 dw 1.802e-01 pp 4.74930e-01 er 0.21588
it 2   dw 3.478e-02 pp 4.70876e-01 er 0.21548
it 3   dw 1.963e-01 pp 4.61761e-01 er 0.21709
it 4   dw 9.624e-02 pp 4.56257e-01 er 0.21574
it 5   dw 3.442e-01 pp 4.54401e-01
it 10  dw 2.399e-03 pp 3.71020e-01 er 0.16967
it 11  dw 2.202e-02 pp 3.71017e-01 er 0.16980
it 12  dw 0.000e+00 pp 3.71017e-01 er 0.16980
it 1 dw 2.620e-02 pp 3.70816e-01 er 0.17020
it 2   dw 2.285e-02 pp 3.70721e-01 er 0.16953
it 3   dw 1.074e-02 pp 3.70631e-01 er 0.16980
it 4   dw 3.152e-02 pp 3.70580e-01 er 0.16994
it 5   dw 2.263e-02 pp 3.70504e-01 er 0.16940
it 6   dw 1.115e-01 pp 3.70370e-01 er 0.16886
it 7   dw 1.938e-01 pp 3.70318e-01 er 0.16913
it 8   dw 1.365e-01 pp 3.69815e-01 er 0.16940
it 9   dw 2.634e-01 pp 3.69366e-01 er 0.16873
it 10  dw 2.498e-01 pp 3.69290e-01 er 0.17007
it 11  dw 2.515e-01 pp 3.69240e-01 er 0.16994
it 12  dw 3.027e-01 pp 3.69234e-01 er 0.16994
it 13  dw 9.850e-03 pp 3.69233e-01 er 0.16994
it 14  dw 1.152e-01 pp 3.69214e-01 er 0.16994
it 15  dw 0.000e+00 pp 3.69214e-01 er 0.16994
In [18]: classify.accuracy(me3_megam_classifier, test_set)
Out[18]: 0.812

The Megam using in NLTK depends the python subprocess.Popen, and it’s very fast and costs fewer resources than the original GIS or IIS Maxent module in NLTK. I have also compiled it in my local mac pro, and meet the same problems like the linux ubuntu compile process. You can also reference this article to compile the Megam in MAC OS: Compiling MegaM on MacOS X.

That’s the end, just enjoy using Megam in your NLTK or Python Projet.

Posted by TextMiner

Deep Learning Specialization on Coursera


Dive Into NLTK, Part VIII: Using External Maximum Entropy Modeling Libraries for Text Classification — 2 Comments

Leave a Reply

Your email address will not be published.