HomeNLPDive Into NLTK, Part IX: From Text Classification to Sentiment Analysis
Deep Learning Specialization on Coursera

This is the ninth article in the series “Dive Into NLTK“, here is an index of all the articles in the series that have been published to date:

Part I: Getting Started with NLTK
Part II: Sentence Tokenize and Word Tokenize
Part III: Part-Of-Speech Tagging and POS Tagger
Part IV: Stemming and Lemmatization
Part V: Using Stanford Text Analysis Tools in Python
Part VI: Add Stanford Word Segmenter Interface for Python NLTK
Part VII: A Preliminary Study on Text Classification
Part VIII: Using External Maximum Entropy Modeling Libraries for Text Classification
Part IX: From Text Classification to Sentiment Analysis
Part X: Play With Word2Vec Models based on NLTK Corpus

According wikipedia, Sentiment Analysis is defined like this:

Sentiment analysis (also known as opinion mining) refers to the use of natural language processing, text analysis and computational linguistics to identify and extract subjective information in source materials. Sentiment analysis is widely applied to reviews and social media for a variety of applications, ranging from marketing to customer service.

Generally speaking, sentiment analysis aims to determine the attitude of a speaker or a writer with respect to some topic or the overall contextual polarity of a document. The attitude may be his or her judgment or evaluation (see appraisal theory), affective state (that is to say, the emotional state of the author when writing), or the intended emotional communication (that is to say, the emotional effect the author wishes to have on the reader).

Generally speaking, sentiment analysis can be seen as one task of text classification. Based on the movie review data from NLTK, we can train a basic text classification model for sentiment analysis:

Python 2.7.6 (default, Jun  3 2014, 07:43:23) 
Type "copyright", "credits" or "license" for more information.
 
IPython 3.1.0 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.
 
In [1]: import nltk
 
In [2]: from nltk.corpus import movie_reviews
 
In [3]: from random import shuffle
 
In [4]: documents = [(list(movie_reviews.words(fileid)), category) 
for category in movie_reviews.categories()
for fileid in movie_reviews.fileids(category)]
 
In [5]: shuffle(documents)
 
In [6]: print documents[0]
([u'the', u'general', u"'", u's', u'daughter', u'will', u'probably', u'be', u'the', u'cleverest', u'stupid', u'film', u'we', u"'", u'll', u'see', u'this', u'year', u'--', u'or', u'perhaps', u'the', u'stupidest', u'clever', u'film', u'.', u'it', u"'", u's', u'confusing', u'to', u'a', u'critic', u'when', u'so', u'much', u'knuckleheaded', u'plotting', u'and', u'ostentatious', u'direction', u'shares', u'the', u'screen', u'with', u'so', u'much', u'snappy', u'dialogue', u'and', u'crisp', u'character', u'interaction', u'.', u'that', u',', u'however', u',', u'is', u'what', u'happens', u'when', u'legendary', u'screenwriter', u'william', u'goldman', u'takes', u'a', u'pass', u'at', u'an', u'otherwise', u'brutally', u'predictable', u'conspiracy', u'thriller', u'.', u'the', u'punched', u'-', u'up', u'punch', u'lines', u'are', u'ever', u'on', u'the', u'verge', u'of', u'convincing', u'you', u'the', u'general', u"'", u's', u'daughter', u'has', u'a', u'brain', u'in', u'its', u'head', u',', u'even', u'as', u'the', u'remaining', u'75', u'%', u'of', u'the', u'narrative', u'punches', u'you', u'in', u'the', u'face', u'with', u'its', u'lack', u'of', u'common', u'sense', u'.', u'our', u'hero', u'is', u'warrant', u'officer', u'paul', u'brenner', u',', u'a', u'brash', u'investigator', u'for', u'the', u'u', u'.', u's', u'.', u'army', u"'", u's', u'criminal', u'investigation', u'division', u'.', u'his', u'latest', u'case', u'is', u'the', u'murder', u'of', u'captain', u'elisabeth', u'campbell', u'(', u'leslie', u'stefanson', u')', u'at', u'a', u'georgia', u'base', u',', u'the', u'victim', u'found', u'tied', u'to', u'the', u'ground', u'after', u'an', u'apparent', u'sexual', u'assault', u'and', u'strangulation', u'.', u'complicating', u'the', u'case', u'is', u'the', u'fact', u'that', u'capt', u'.', u'campbell', u'is', u'the', u'daughter', u'of', u'general', u'joe', u'campbell', u'(', u'james', u'cromwell', u')', u',', u'a', u'war', u'hero', u'and', u'potential', u'vice', u'-', u'presidential', u'nominee', u'.', 
......
u'general', u'campbell', u'wants', u'to', u'keep', u'the', u'case', u'out', u'of', u'the', u'press', u',', u'which', u'gives', u'brenner', u'only', u'the', u'36', u'hours', u'before', u'the', u'fbi', u'steps', u'in', u'.', u'teamed', u'with', u'rape', u'investigator', u'sarah', u'sunhill', u'(', u'madeleine', u'stowe', u')', u'--', u'who', u',', u'coincidentally', u'enough', u',', u'once', u'had', u'a', u'romantic', u'relationship', u'with', u'brenner', u'--', u'brenner', u'begins', u'uncovering', \ u'evidence', u'out', u'of', u'the', u'corner', u'of', u'his', u'eye', u')', u'.', u'by', u'the', u'time', u'the', u'general', u"'", u's', u'daughter', u'wanders', u'towards', u'its', u'over', u'-', u'wrought', u',', u'psycho', u'-', u'in', u'-', u'the', u'-', u'rain', u'finale', u',', u'west', u"'", u's', u'heavy', u'hand', u'has', u'obliterated', u'most', u'of', u'what', u'made', u'the', u'film', u'occasionally', u'fun', u'.', u'it', u"'", u's', u'silly', u'and', u'pretentious', u'film', u'-', u'making', u',', u'but', u'at', u'least', u'it', u'provides', u'a', u'giggle', u'or', u'five', u'.', u'goldman', u'should', u'tear', u'the', u'15', u'decent', u'pages', u'out', u'of', u'this', u'script', u'and', u'turn', u'them', u'into', u'a', u'stand', u'-', u'up', u'routine', u'.'], u'neg')
 
 
# The total number of movie reviews documents in nltk is 2000
In [7]: len(documents)
Out[7]: 2000
 
 
# Construct a list of the 2,000 most frequent words in the overall corpus 
In [8]: all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())
 
In [9]: word_features = all_words.keys()[:2000]
 
# Define a feature extractor that simply checks whether each of these words is present in a given document.
In [10]: def document_features(document):
   ....:     document_words = set(document)
   ....:     features = {}
   ....:     for word in word_features:
   ....:         features['contains(%s)' % word] = (word in document_words)
   ....:     return features
   ....: 
 
 
In [11]: print document_features(movie_reviews.words('pos/cv957_8737.txt'))
{u'contains(waste)': False, u'contains(lot)': False, u'contains(*)': True, u'contains(black)': False, u'contains(rated)': False, u'contains(potential)': False, u'contains(m)': False, u'contains(understand)': False, u'contains(drug)': True, u'contains(case)': False, u'contains(created)': False, u'contains(kiss)': False, u'contains(needed)': False, u'contains(c)': False, u'contains(about)': True, u'contains(toy)': False, u'contains(longer)': False, u'contains(ready)': False, u'contains(certainly)': False, 
......
u'contains(good)': False, u'contains(live)': False, u'contains(appropriate)': False, u'contains(towards)': False, u'contains(smile)': False, u'contains(cross)': False}
 
# Generate the feature sets for the movie review documents one by one
In [12]: featuresets = [(document_features(d), c) for (d, c) in documents]
 
# Define the train set (1900 documents) and test set (100 documents)
In [13]: train_set, test_set = featuresets[100:], featuresets[:100]
 
# Train a naive bayes classifier with train set by nltk
In [14]: classifier = nltk.NaiveBayesClassifier.train(train_set)
 
# Get the accuracy of the naive bayes classifier with test set
In [15]: print nltk.classify.accuracy(classifier, test_set)
0.81
 
# Debug info: show top n most informative features
In [16]: classifier.show_most_informative_features(10)
Most Informative Features
   contains(outstanding) = True              pos : neg    =     13.3 : 1.0
         contains(mulan) = True              pos : neg    =      8.8 : 1.0
        contains(seagal) = True              neg : pos    =      8.0 : 1.0
   contains(wonderfully) = True              pos : neg    =      6.5 : 1.0
         contains(damon) = True              pos : neg    =      6.2 : 1.0
         contains(awful) = True              neg : pos    =      6.0 : 1.0
        contains(wasted) = True              neg : pos    =      5.9 : 1.0
          contains(lame) = True              neg : pos    =      5.8 : 1.0
         contains(flynt) = True              pos : neg    =      5.5 : 1.0
        contains(poorly) = True              neg : pos    =      5.1 : 1.0

Based on the top-2000 word features, we can train a Maximum entropy classifier model with NLTK and MEGAM:

In [17]: maxent_classifier = nltk.MaxentClassifier.train(train_set, "megam")
[Found megam: /usr/local/bin/megam]
Scanning file...1900 train, 0 dev, 0 test, reading...done
Warning: there only appear to be two classes, but we're
         optimizing with BFGS...using binary optimization
         with CG would be much faster
optimizing with lambda = 0
it 1   dw 2.415e-03 pp 6.85543e-01 er 0.49895
it 2   dw 1.905e-03 pp 6.72937e-01 er 0.48895
it 3   dw 7.755e-03 pp 6.53779e-01 er 0.19526
it 4   dw 1.583e-02 pp 6.30863e-01 er 0.33526
it 5   dw 4.763e-02 pp 5.89126e-01 er 0.33895
it 6   dw 8.723e-02 pp 5.09921e-01 er 0.21211
it 7   dw 2.223e-01 pp 4.13823e-01 er 0.17000
it 8   dw 2.183e-01 pp 3.81889e-01 er 0.16526
it 9   dw 3.448e-01 pp 3.79054e-01 er 0.17421
it 10  dw 7.749e-02 pp 3.73549e-01 er 0.17105
it 11  dw 1.413e-01 pp 3.61806e-01 er 0.15842
it 12  dw 1.380e-01 pp 3.61716e-01 er 0.16000
it 13  dw 5.230e-02 pp 3.59953e-01 er 0.16053
it 14  dw 1.092e-01 pp 3.58713e-01 er 0.16211
it 15  dw 1.252e-01 pp 3.58669e-01 er 0.16000
it 16  dw 1.370e-01 pp 3.57027e-01 er 0.16105
it 17  dw 2.213e-01 pp 3.56230e-01 er 0.15684
it 18  dw 1.397e-01 pp 3.51368e-01 er 0.15579
it 19  dw 7.718e-01 pp 3.38156e-01 er 0.14947
it 20  dw 6.426e-02 pp 3.36342e-01 er 0.14947
it 21  dw 1.531e-01 pp 3.33402e-01 er 0.15053
it 22  dw 1.047e-01 pp 3.33287e-01 er 0.14895
it 23  dw 1.379e-01 pp 3.30814e-01 er 0.14895
it 24  dw 1.480e+00 pp 3.02938e-01 er 0.12842
it 25  dw 0.000e+00 pp 3.02938e-01 er 0.12842
-------------------------
......
......
-------------------------
it 1 dw 1.981e-05 pp 8.59536e-02 er 0.00684
it 2   dw 4.179e-05 pp 8.58979e-02 er 0.00684
it 3   dw 3.792e-04 pp 8.56536e-02 er 0.00684
it 4   dw 1.076e-03 pp 8.52961e-02 er 0.00737
it 5   dw 2.007e-03 pp 8.49459e-02 er 0.00737
it 6   dw 4.055e-03 pp 8.42942e-02 er 0.00737
it 7   dw 2.664e-02 pp 8.16976e-02 er 0.00526
it 8   dw 1.888e-02 pp 8.12042e-02 er 0.00316
it 9   dw 5.093e-02 pp 8.08672e-02 er 0.00316
it 10  dw 3.968e-03 pp 8.08624e-02 er 0.00316
it 11  dw 0.000e+00 pp 8.08624e-02 er 0.00316
 
In [18]: print nltk.classify.accuracy(maxent_classifier, test_set)
0.89
 
In [19]: maxent_classifier.show_most_informative_features(10)
  -1.843 contains(waste)==False and label is u'neg'
  -1.006 contains(boring)==False and label is u'neg'
  -0.979 contains(worst)==False and label is u'neg'
  -0.973 contains(bad)==False and label is u'neg'
  -0.953 contains(unfortunately)==False and label is u'neg'
  -0.864 contains(lame)==False and label is u'neg'
  -0.850 contains(attempt)==False and label is u'neg'
  -0.833 contains(supposed)==False and label is u'neg'
  -0.815 contains(seen)==True and label is u'neg'
  -0.783 contains(laughable)==False and label is u'neg'

It seems that the maxent classifier has the better classifier result on the test set. Let’s classify a test text with the Naive Bayes Classifier and Maxent Classifier:

In [22]:  test_text = "I love this movie, very interesting"
 
In [23]: test_set = document_features(test_text.split())
 
In [24]: test_set
Out[24]: 
{u'contains(waste)': False,
 u'contains(lot)': False,
 u'contains(*)': False,
 u'contains(black)': False,
 u'contains(rated)': False,
 u'contains(potential)': False,
 u'contains(m)': False,
 u'contains(understand)': False,
 u'contains(drug)': False,
 u'contains(case)': False,
 u'contains(created)': False,
 u'contains(kiss)': False,
 u'contains(needed)': False,
 ......
 u'contains(happens)': False,
 u'contains(suddenly)': False,
 u'contains(almost)': False,
 u'contains(evil)': False,
 u'contains(building)': False,
 u'contains(michael)': False,
 ...}
 
# Naivebayes classifier get the wrong result
In [25]: print classifier.classify(test_set)
neg
 
# Maxent Classifier done right
In [26]: print maxent_classifier.classify(test_set)
pos
 
# Let's see the probability result
In [27]: prob_result = classifier.prob_classify(test_set)
 
In [28]: prob_result
Out[28]: <ProbDist with 2 samples>
 
In [29]: prob_result.max()
Out[29]: u'neg'
 
In [30]: prob_result.prob("neg")
Out[30]: 0.99999917093621
 
In [31]: prob_result.prob("pos")
Out[31]: 8.29063793272753e-07
 
# Maxent classifier probability result
In [32]: print maxent_classifier.classify(test_set)
pos
 
In [33]: prob_result = maxent_classifier.prob_classify(test_set)
 
In [33]: prob_result.prob("pos")
Out[33]: 0.67570114045832497
 
In [34]: prob_result.prob("neg")
Out[34]: 0.32429885954167498

Till now, we just used the top-n word features, and for this sentiment analysis machine learning problem, add more features may be get better result. So we redesign the word features:

In [40]: def bag_of_words(words):
   ....:     return dict([(word, True) for word in words])
   ....: 
 
In [43]: data_sets = [(bag_of_words(d), c) for (d, c) in documents]
 
In [44]: len(data_sets)
Out[44]: 2000
 
In [45]: train_set, test_set = data_sets[100:], data_sets[:100]
 
In [46]: bayes_classifier = nltk.NaiveBayesClassifier.train(train_set)
 
In [47]: print nltk.classify.accuracy(bayes_classifier, test_set)
0.8
 
In [48]: bayes_classifier.show_most_informative_features(10)
Most Informative Features
             outstanding = True              pos : neg    =     13.9 : 1.0
                  avoids = True              pos : neg    =     13.1 : 1.0
              astounding = True              pos : neg    =     11.7 : 1.0
                 insipid = True              neg : pos    =     11.0 : 1.0
                    3000 = True              neg : pos    =     11.0 : 1.0
               insulting = True              neg : pos    =     10.6 : 1.0
            manipulation = True              pos : neg    =     10.4 : 1.0
             fascination = True              pos : neg    =     10.4 : 1.0
                    slip = True              pos : neg    =     10.4 : 1.0
               ludicrous = True              neg : pos    =     10.1 : 1.0
 
In [49]: maxent_bg_classifier = nltk.MaxentClassifier.train(train_set, "megam")
Scanning file...1900 train, 0 dev, 0 test, reading...done
Warning: there only appear to be two classes, but we're
         optimizing with BFGS...using binary optimization
         with CG would be much faster
optimizing with lambda = 0
it 1   dw 1.255e-01 pp 3.91521e-01 er 0.15368
it 2   dw 1.866e-02 pp 3.82995e-01 er 0.14684
it 3   dw 3.912e-02 pp 3.46794e-01 er 0.13368
it 4   dw 5.916e-02 pp 3.26135e-01 er 0.13684
it 5   dw 2.929e-02 pp 3.23077e-01 er 0.13474
it 6   dw 2.552e-02 pp 3.15917e-01 er 0.13526
it 7   dw 2.765e-02 pp 3.14291e-01 er 0.13526
it 8   dw 8.298e-02 pp 2.35472e-01 er 0.07263
it 9   dw 1.357e-01 pp 2.20265e-01 er 0.08684
it 10  dw 6.186e-02 pp 2.03567e-01 er 0.07158
it 11  dw 2.057e-01 pp 1.69049e-01 er 0.05316
it 12  dw 1.319e-01 pp 1.61575e-01 er 0.05263
it 13  dw 8.872e-02 pp 1.59902e-01 er 0.05526
it 14  dw 5.907e-02 pp 1.59254e-01 er 0.05632
it 15  dw 4.443e-02 pp 1.54540e-01 er 0.05368
it 16  dw 3.677e-01 pp 1.48646e-01 er 0.03842
it 17  dw 2.500e-01 pp 1.47460e-01 er 0.03947
it 18  dw 9.548e-01 pp 1.44516e-01 er 0.03842
it 19  dw 3.466e-01 pp 1.42935e-01 er 0.04211
it 20  dw 1.872e-02 pp 1.42847e-01 er 0.04263
it 21  dw 1.452e-01 pp 1.28344e-01 er 0.02737
it 22  dw 1.248e-01 pp 1.24428e-01 er 0.02526
it 23  dw 4.071e-01 pp 1.18201e-01 er 0.02211
it 24  dw 3.979e-01 pp 1.08352e-01 er 0.01526
it 25  dw 1.871e-01 pp 1.08345e-01 er 0.01632
it 26  dw 8.477e-02 pp 1.07972e-01 er 0.01579
it 27  dw 0.000e+00 pp 1.07972e-01 er 0.01579
-------------------------
.......
-------------------------
it 12  dw 4.018e-02 pp 1.73432e-05 er 0.00000
it 13  dw 3.898e-02 pp 1.62334e-05 er 0.00000
it 14  dw 9.937e-02 pp 1.52647e-05 er 0.00000
it 15  dw 5.558e-02 pp 1.31892e-05 er 0.00000
it 16  dw 5.646e-02 pp 1.30511e-05 er 0.00000
it 17  dw 1.100e-01 pp 1.23914e-05 er 0.00000
it 18  dw 4.541e-02 pp 1.17382e-05 er 0.00000
it 19  dw 1.316e-01 pp 1.04446e-05 er 0.00000
it 20  dw 1.919e-01 pp 9.04729e-06 er 0.00000
it 21  dw 1.039e-02 pp 9.02896e-06 er 0.00000
it 22  dw 2.843e-01 pp 8.92068e-06 er 0.00000
it 23  dw 1.100e-01 pp 8.54637e-06 er 0.00000
it 24  dw 2.199e-01 pp 8.36371e-06 er 0.00000
it 25  dw 2.428e-02 pp 8.24041e-06 er 0.00000
it 26  dw 0.000e+00 pp 8.24041e-06 er 0.00000
 
In [50]: print nltk.classify.accuracy(maxent_bg_classifier, test_set)
0.89
 
In [51]: maxent_bg_classifier.show_most_informative_features(10)
  -4.151 get==True and label is u'neg'
  -2.961 get==True and label is u'pos'
  -2.596 all==True and label is u'neg'
  -2.523 out==True and label is u'pos'
  -2.400 years==True and label is u'neg'
  -2.397 its==True and label is u'pos'
  -2.340 them==True and label is u'neg'
  -2.327 out==True and label is u'neg'
  -2.324 ,==True and label is u'neg'
  -2.259 (==True and label is u'neg'

Now we can test the bigrams feature in the classifier model:

In [52]: from nltk import ngrams
 
In [53]: def bag_of_ngrams(words, n=2):
   ....:     ngs = [ng for ng in iter(ngrams(words, n))]
   ....:     return bag_of_words(ngs)
   ....: 
 
In [54]: data_sets = [(bag_of_ngrams(d), c) for (d, c) in documents]
 
In [55]: train_set, test_set = data_sets[100:], data_sets[:100]
 
In [56]: nb_bi_classifier = nltk.NaiveBayesClassifier.train(train_set)
 
In [57]: print nltk.classify.accuracy(nb_bi_classifier, test_set)
0.83
 
In [59]: nb_bi_classifier.show_most_informative_features(10)
Most Informative Features
    (u'is', u'terrific') = True              pos : neg    =     17.1 : 1.0
      (u'not', u'funny') = True              neg : pos    =     16.9 : 1.0
     (u'boring', u'and') = True              neg : pos    =     13.6 : 1.0
     (u'and', u'boring') = True              neg : pos    =     13.6 : 1.0
        (u'our', u'own') = True              pos : neg    =     13.1 : 1.0
        (u'why', u'did') = True              neg : pos    =     12.9 : 1.0
    (u'enjoyable', u',') = True              pos : neg    =     12.4 : 1.0
     (u'works', u'well') = True              pos : neg    =     12.4 : 1.0
      (u'.', u'cameron') = True              pos : neg    =     12.4 : 1.0
     (u'well', u'worth') = True              pos : neg    =     12.4 : 1.0
 
In [60]: maxent_bi_classifier = nltk.MaxentClassifier.train(train_set, "megam")
Scanning file...1900 train, 0 dev, 0 test, reading...done
Warning: there only appear to be two classes, but we're
         optimizing with BFGS...using binary optimization
         with CG would be much faster
optimizing with lambda = 0
it 1   dw 6.728e-02 pp 4.68710e-01 er 0.25895
it 2   dw 6.127e-02 pp 3.37578e-01 er 0.13789
it 3   dw 1.712e-02 pp 2.94106e-01 er 0.11737
it 4   dw 2.538e-02 pp 2.68465e-01 er 0.11526
it 5   dw 3.965e-02 pp 2.46789e-01 er 0.10684
it 6   dw 1.240e-01 pp 1.98149e-01 er 0.07947
it 7   dw 1.640e-02 pp 1.62956e-01 er 0.05895
it 8   dw 1.320e-01 pp 1.07163e-01 er 0.02789
it 9   dw 1.233e-01 pp 8.79358e-02 er 0.01368
it 10  dw 2.815e-01 pp 5.51191e-02 er 0.00737
it 11  dw 1.127e-01 pp 3.91500e-02 er 0.00421
it 12  dw 3.463e-01 pp 2.95846e-02 er 0.00211
it 13  dw 1.114e-01 pp 2.90701e-02 er 0.00053
it 14  dw 1.453e-01 pp 1.95422e-02 er 0.00053
it 15  dw 1.976e-01 pp 1.54022e-02 er 0.00105
......
it 44  dw 2.544e-01 pp 9.05755e-15 er 0.00000
it 45  dw 4.974e-02 pp 9.02763e-15 er 0.00000
it 46  dw 9.311e-07 pp 9.02483e-15 er 0.00000
it 47  dw 0.000e+00 pp 9.02483e-15 er 0.00000
 
In [61]: print nltk.classify.accuracy(maxent_bi_classifier, test_set)
0.9
 
In [62]: maxent_bi_classifier.show_most_informative_features(10)
 -14.152 (u'a', u'man')==True and label is u'neg'
  12.821 (u'"', u'the')==True and label is u'neg'
 -12.399 (u'of', u'the')==True and label is u'neg'
 -11.881 (u'a', u'man')==True and label is u'pos'
  10.020 (u',', u'which')==True and label is u'neg'
   8.418 (u'and', u'that')==True and label is u'neg'
  -8.022 (u'and', u'the')==True and label is u'neg'
  -7.191 (u'on', u'a')==True and label is u'neg'
  -7.185 (u'on', u'a')==True and label is u'pos'
   7.107 (u',', u'which')==True and label is u'pos'

And again, we can use the words feature and ngrams (bigrams) feature together:

In [63]: def bag_of_all(words, n=2):
   ....:     all_features = bag_of_words(words)
   ....:     ngram_features = bag_of_ngrams(words, n=n)
   ....:     all_features.update(ngram_features)   
   ....:     return all_features
   ....: 
 
In [64]: data_sets = [(bag_of_all(d), c) for (d, c) in documents]
 
In [65]: train_set, test_set = data_sets[100:], data_sets[:100]
 
In [66]: nb_all_classifier = nltk.NaiveBayesClassifier.train(train_set)
 
In [67]: print nltk.classify.accuracy(nb_all_classifier, test_set)
0.83
 
In [68]: nb_all_classifier.show_most_informative_features(10)
Most Informative Features
    (u'is', u'terrific') = True              pos : neg    =     17.1 : 1.0
      (u'not', u'funny') = True              neg : pos    =     16.9 : 1.0
             outstanding = True              pos : neg    =     13.9 : 1.0
     (u'boring', u'and') = True              neg : pos    =     13.6 : 1.0
     (u'and', u'boring') = True              neg : pos    =     13.6 : 1.0
                  avoids = True              pos : neg    =     13.1 : 1.0
        (u'our', u'own') = True              pos : neg    =     13.1 : 1.0
        (u'why', u'did') = True              neg : pos    =     12.9 : 1.0
    (u'enjoyable', u',') = True              pos : neg    =     12.4 : 1.0
     (u'works', u'well') = True              pos : neg    =     12.4 : 1.0
 
 
In [71]: maxent_all_classifier = nltk.MaxentClassifier.train(train_set, "megam") 
Scanning file...1900 train, 0 dev, 0 test, reading...done
Warning: there only appear to be two classes, but we're
         optimizing with BFGS...using binary optimization
         with CG would be much faster
optimizing with lambda = 0
it 1   dw 8.715e-02 pp 3.82841e-01 er 0.17684
it 2   dw 2.846e-02 pp 2.97371e-01 er 0.11632
it 3   dw 1.299e-02 pp 2.79797e-01 er 0.11421
it 4   dw 2.456e-02 pp 2.64735e-01 er 0.11053
it 5   dw 4.200e-02 pp 2.47440e-01 er 0.10789
it 6   dw 1.417e-01 pp 2.04814e-01 er 0.08737
it 7   dw 1.330e-02 pp 2.03060e-01 er 0.08737
it 8   dw 3.177e-02 pp 1.92654e-01 er 0.08421
it 9   dw 5.613e-02 pp 1.38725e-01 er 0.05789
it 10  dw 1.339e-01 pp 7.92844e-02 er 0.02368
it 11  dw 1.734e-01 pp 6.71341e-02 er 0.01316
it 12  dw 1.313e-01 pp 6.55828e-02 er 0.01263
it 13  dw 2.036e-01 pp 6.38482e-02 er 0.01421
it 14  dw 1.230e-02 pp 5.96907e-02 er 0.01368
it 15  dw 9.719e-02 pp 4.03190e-02 er 0.00842
it 16  dw 4.004e-02 pp 3.98276e-02 er 0.00737
it 17  dw 1.598e-01 pp 2.68187e-02 er 0.00316
it 18  dw 1.900e-01 pp 2.57116e-02 er 0.00211
it 19  dw 4.355e-01 pp 2.14572e-02 er 0.00263
it 20  dw 1.029e-01 pp 1.91407e-02 er 0.00211
it 21  dw 1.347e-01 pp 1.46859e-02 er 0.00105
it 22  dw 2.231e-01 pp 1.26997e-02 er 0.00053
it 23  dw 2.942e-01 pp 1.20663e-02 er 0.00000
it 24  dw 3.836e-01 pp 1.14817e-02 er 0.00000
it 25  dw 4.213e-01 pp 9.89037e-03 er 0.00000
it 26  dw 1.875e-01 pp 7.06744e-03 er 0.00000
it 27  dw 2.865e-01 pp 5.61255e-03 er 0.00000
it 28  dw 5.903e-01 pp 4.94776e-03 er 0.00000
it 29  dw 0.000e+00 pp 4.94776e-03 er 0.00000
-------------------------
.......
-------------------------
.......
it 8   dw 2.024e-01 pp 8.14623e-10 er 0.00000
it 9   dw 9.264e-02 pp 7.87683e-10 er 0.00000
it 10  dw 5.845e-02 pp 7.38397e-10 er 0.00000
it 11  dw 2.418e-01 pp 6.34000e-10 er 0.00000
it 12  dw 5.081e-01 pp 6.19061e-10 er 0.00000
it 13  dw 0.000e+00 pp 6.19061e-10 er 0.00000
 
In [72]: print nltk.classify.accuracy(maxent_all_classifier, test_set)
0.91
 
In [73]: maxent_all_classifier.show_most_informative_features(10)
  11.220 to==True and label is u'neg'
   3.415 ,==True and label is u'neg'
   3.360 '==True and label is u'neg'
   3.310 this==True and label is u'neg'
   3.243 a==True and label is u'neg'
  -3.218 (u'does', u'a')==True and label is u'neg'
   3.164 have==True and label is u'neg'
  -3.024 what==True and label is u'neg'
   2.966 more==True and label is u'neg'
  -2.891 (u',', u'which')==True and label is u'neg'

We get the best sentiment analysis performance on this case, although there were some problems, such as the punctuations and stop words were not discarded. Just as a case study, we encourage you to test on more data or more features, or better machine learning models such as deep learning method.

Posted by TextMiner

Deep Learning Specialization on Coursera

Comments

Dive Into NLTK, Part IX: From Text Classification to Sentiment Analysis — 1 Comment

Leave a Reply

Your email address will not be published.