Dive Into NLTK, Part IX: From Text Classification to Sentiment Analysis
This is the ninth article in the series “Dive Into NLTK“, here is an index of all the articles in the series that have been published to date:
Part I: Getting Started with NLTK
Part II: Sentence Tokenize and Word Tokenize
Part III: Part-Of-Speech Tagging and POS Tagger
Part IV: Stemming and Lemmatization
Part V: Using Stanford Text Analysis Tools in Python
Part VI: Add Stanford Word Segmenter Interface for Python NLTK
Part VII: A Preliminary Study on Text Classification
Part VIII: Using External Maximum Entropy Modeling Libraries for Text Classification
Part IX: From Text Classification to Sentiment Analysis
Part X: Play With Word2Vec Models based on NLTK Corpus
According wikipedia, Sentiment Analysis is defined like this:
Sentiment analysis (also known as opinion mining) refers to the use of natural language processing, text analysis and computational linguistics to identify and extract subjective information in source materials. Sentiment analysis is widely applied to reviews and social media for a variety of applications, ranging from marketing to customer service.
Generally speaking, sentiment analysis aims to determine the attitude of a speaker or a writer with respect to some topic or the overall contextual polarity of a document. The attitude may be his or her judgment or evaluation (see appraisal theory), affective state (that is to say, the emotional state of the author when writing), or the intended emotional communication (that is to say, the emotional effect the author wishes to have on the reader).
Generally speaking, sentiment analysis can be seen as one task of text classification. Based on the movie review data from NLTK, we can train a basic text classification model for sentiment analysis:
Python 2.7.6 (default, Jun 3 2014, 07:43:23) Type "copyright", "credits" or "license" for more information. IPython 3.1.0 -- An enhanced Interactive Python. ? -> Introduction and overview of IPython's features. %quickref -> Quick reference. help -> Python's own help system. object? -> Details about 'object', use 'object??' for extra details. In [1]: import nltk In [2]: from nltk.corpus import movie_reviews In [3]: from random import shuffle In [4]: documents = [(list(movie_reviews.words(fileid)), category) for category in movie_reviews.categories() for fileid in movie_reviews.fileids(category)] In [5]: shuffle(documents) In [6]: print documents[0] ([u'the', u'general', u"'", u's', u'daughter', u'will', u'probably', u'be', u'the', u'cleverest', u'stupid', u'film', u'we', u"'", u'll', u'see', u'this', u'year', u'--', u'or', u'perhaps', u'the', u'stupidest', u'clever', u'film', u'.', u'it', u"'", u's', u'confusing', u'to', u'a', u'critic', u'when', u'so', u'much', u'knuckleheaded', u'plotting', u'and', u'ostentatious', u'direction', u'shares', u'the', u'screen', u'with', u'so', u'much', u'snappy', u'dialogue', u'and', u'crisp', u'character', u'interaction', u'.', u'that', u',', u'however', u',', u'is', u'what', u'happens', u'when', u'legendary', u'screenwriter', u'william', u'goldman', u'takes', u'a', u'pass', u'at', u'an', u'otherwise', u'brutally', u'predictable', u'conspiracy', u'thriller', u'.', u'the', u'punched', u'-', u'up', u'punch', u'lines', u'are', u'ever', u'on', u'the', u'verge', u'of', u'convincing', u'you', u'the', u'general', u"'", u's', u'daughter', u'has', u'a', u'brain', u'in', u'its', u'head', u',', u'even', u'as', u'the', u'remaining', u'75', u'%', u'of', u'the', u'narrative', u'punches', u'you', u'in', u'the', u'face', u'with', u'its', u'lack', u'of', u'common', u'sense', u'.', u'our', u'hero', u'is', u'warrant', u'officer', u'paul', u'brenner', u',', u'a', u'brash', u'investigator', u'for', u'the', u'u', u'.', u's', u'.', u'army', u"'", u's', u'criminal', u'investigation', u'division', u'.', u'his', u'latest', u'case', u'is', u'the', u'murder', u'of', u'captain', u'elisabeth', u'campbell', u'(', u'leslie', u'stefanson', u')', u'at', u'a', u'georgia', u'base', u',', u'the', u'victim', u'found', u'tied', u'to', u'the', u'ground', u'after', u'an', u'apparent', u'sexual', u'assault', u'and', u'strangulation', u'.', u'complicating', u'the', u'case', u'is', u'the', u'fact', u'that', u'capt', u'.', u'campbell', u'is', u'the', u'daughter', u'of', u'general', u'joe', u'campbell', u'(', u'james', u'cromwell', u')', u',', u'a', u'war', u'hero', u'and', u'potential', u'vice', u'-', u'presidential', u'nominee', u'.', ...... u'general', u'campbell', u'wants', u'to', u'keep', u'the', u'case', u'out', u'of', u'the', u'press', u',', u'which', u'gives', u'brenner', u'only', u'the', u'36', u'hours', u'before', u'the', u'fbi', u'steps', u'in', u'.', u'teamed', u'with', u'rape', u'investigator', u'sarah', u'sunhill', u'(', u'madeleine', u'stowe', u')', u'--', u'who', u',', u'coincidentally', u'enough', u',', u'once', u'had', u'a', u'romantic', u'relationship', u'with', u'brenner', u'--', u'brenner', u'begins', u'uncovering', \ u'evidence', u'out', u'of', u'the', u'corner', u'of', u'his', u'eye', u')', u'.', u'by', u'the', u'time', u'the', u'general', u"'", u's', u'daughter', u'wanders', u'towards', u'its', u'over', u'-', u'wrought', u',', u'psycho', u'-', u'in', u'-', u'the', u'-', u'rain', u'finale', u',', u'west', u"'", u's', u'heavy', u'hand', u'has', u'obliterated', u'most', u'of', u'what', u'made', u'the', u'film', u'occasionally', u'fun', u'.', u'it', u"'", u's', u'silly', u'and', u'pretentious', u'film', u'-', u'making', u',', u'but', u'at', u'least', u'it', u'provides', u'a', u'giggle', u'or', u'five', u'.', u'goldman', u'should', u'tear', u'the', u'15', u'decent', u'pages', u'out', u'of', u'this', u'script', u'and', u'turn', u'them', u'into', u'a', u'stand', u'-', u'up', u'routine', u'.'], u'neg') # The total number of movie reviews documents in nltk is 2000 In [7]: len(documents) Out[7]: 2000 # Construct a list of the 2,000 most frequent words in the overall corpus In [8]: all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words()) In [9]: word_features = all_words.keys()[:2000] # Define a feature extractor that simply checks whether each of these words is present in a given document. In [10]: def document_features(document): ....: document_words = set(document) ....: features = {} ....: for word in word_features: ....: features['contains(%s)' % word] = (word in document_words) ....: return features ....: In [11]: print document_features(movie_reviews.words('pos/cv957_8737.txt')) {u'contains(waste)': False, u'contains(lot)': False, u'contains(*)': True, u'contains(black)': False, u'contains(rated)': False, u'contains(potential)': False, u'contains(m)': False, u'contains(understand)': False, u'contains(drug)': True, u'contains(case)': False, u'contains(created)': False, u'contains(kiss)': False, u'contains(needed)': False, u'contains(c)': False, u'contains(about)': True, u'contains(toy)': False, u'contains(longer)': False, u'contains(ready)': False, u'contains(certainly)': False, ...... u'contains(good)': False, u'contains(live)': False, u'contains(appropriate)': False, u'contains(towards)': False, u'contains(smile)': False, u'contains(cross)': False} # Generate the feature sets for the movie review documents one by one In [12]: featuresets = [(document_features(d), c) for (d, c) in documents] # Define the train set (1900 documents) and test set (100 documents) In [13]: train_set, test_set = featuresets[100:], featuresets[:100] # Train a naive bayes classifier with train set by nltk In [14]: classifier = nltk.NaiveBayesClassifier.train(train_set) # Get the accuracy of the naive bayes classifier with test set In [15]: print nltk.classify.accuracy(classifier, test_set) 0.81 # Debug info: show top n most informative features In [16]: classifier.show_most_informative_features(10) Most Informative Features contains(outstanding) = True pos : neg = 13.3 : 1.0 contains(mulan) = True pos : neg = 8.8 : 1.0 contains(seagal) = True neg : pos = 8.0 : 1.0 contains(wonderfully) = True pos : neg = 6.5 : 1.0 contains(damon) = True pos : neg = 6.2 : 1.0 contains(awful) = True neg : pos = 6.0 : 1.0 contains(wasted) = True neg : pos = 5.9 : 1.0 contains(lame) = True neg : pos = 5.8 : 1.0 contains(flynt) = True pos : neg = 5.5 : 1.0 contains(poorly) = True neg : pos = 5.1 : 1.0 |
Based on the top-2000 word features, we can train a Maximum entropy classifier model with NLTK and MEGAM:
In [17]: maxent_classifier = nltk.MaxentClassifier.train(train_set, "megam") [Found megam: /usr/local/bin/megam] Scanning file...1900 train, 0 dev, 0 test, reading...done Warning: there only appear to be two classes, but we're optimizing with BFGS...using binary optimization with CG would be much faster optimizing with lambda = 0 it 1 dw 2.415e-03 pp 6.85543e-01 er 0.49895 it 2 dw 1.905e-03 pp 6.72937e-01 er 0.48895 it 3 dw 7.755e-03 pp 6.53779e-01 er 0.19526 it 4 dw 1.583e-02 pp 6.30863e-01 er 0.33526 it 5 dw 4.763e-02 pp 5.89126e-01 er 0.33895 it 6 dw 8.723e-02 pp 5.09921e-01 er 0.21211 it 7 dw 2.223e-01 pp 4.13823e-01 er 0.17000 it 8 dw 2.183e-01 pp 3.81889e-01 er 0.16526 it 9 dw 3.448e-01 pp 3.79054e-01 er 0.17421 it 10 dw 7.749e-02 pp 3.73549e-01 er 0.17105 it 11 dw 1.413e-01 pp 3.61806e-01 er 0.15842 it 12 dw 1.380e-01 pp 3.61716e-01 er 0.16000 it 13 dw 5.230e-02 pp 3.59953e-01 er 0.16053 it 14 dw 1.092e-01 pp 3.58713e-01 er 0.16211 it 15 dw 1.252e-01 pp 3.58669e-01 er 0.16000 it 16 dw 1.370e-01 pp 3.57027e-01 er 0.16105 it 17 dw 2.213e-01 pp 3.56230e-01 er 0.15684 it 18 dw 1.397e-01 pp 3.51368e-01 er 0.15579 it 19 dw 7.718e-01 pp 3.38156e-01 er 0.14947 it 20 dw 6.426e-02 pp 3.36342e-01 er 0.14947 it 21 dw 1.531e-01 pp 3.33402e-01 er 0.15053 it 22 dw 1.047e-01 pp 3.33287e-01 er 0.14895 it 23 dw 1.379e-01 pp 3.30814e-01 er 0.14895 it 24 dw 1.480e+00 pp 3.02938e-01 er 0.12842 it 25 dw 0.000e+00 pp 3.02938e-01 er 0.12842 ------------------------- ...... ...... ------------------------- it 1 dw 1.981e-05 pp 8.59536e-02 er 0.00684 it 2 dw 4.179e-05 pp 8.58979e-02 er 0.00684 it 3 dw 3.792e-04 pp 8.56536e-02 er 0.00684 it 4 dw 1.076e-03 pp 8.52961e-02 er 0.00737 it 5 dw 2.007e-03 pp 8.49459e-02 er 0.00737 it 6 dw 4.055e-03 pp 8.42942e-02 er 0.00737 it 7 dw 2.664e-02 pp 8.16976e-02 er 0.00526 it 8 dw 1.888e-02 pp 8.12042e-02 er 0.00316 it 9 dw 5.093e-02 pp 8.08672e-02 er 0.00316 it 10 dw 3.968e-03 pp 8.08624e-02 er 0.00316 it 11 dw 0.000e+00 pp 8.08624e-02 er 0.00316 In [18]: print nltk.classify.accuracy(maxent_classifier, test_set) 0.89 In [19]: maxent_classifier.show_most_informative_features(10) -1.843 contains(waste)==False and label is u'neg' -1.006 contains(boring)==False and label is u'neg' -0.979 contains(worst)==False and label is u'neg' -0.973 contains(bad)==False and label is u'neg' -0.953 contains(unfortunately)==False and label is u'neg' -0.864 contains(lame)==False and label is u'neg' -0.850 contains(attempt)==False and label is u'neg' -0.833 contains(supposed)==False and label is u'neg' -0.815 contains(seen)==True and label is u'neg' -0.783 contains(laughable)==False and label is u'neg' |
It seems that the maxent classifier has the better classifier result on the test set. Let’s classify a test text with the Naive Bayes Classifier and Maxent Classifier:
In [22]: test_text = "I love this movie, very interesting" In [23]: test_set = document_features(test_text.split()) In [24]: test_set Out[24]: {u'contains(waste)': False, u'contains(lot)': False, u'contains(*)': False, u'contains(black)': False, u'contains(rated)': False, u'contains(potential)': False, u'contains(m)': False, u'contains(understand)': False, u'contains(drug)': False, u'contains(case)': False, u'contains(created)': False, u'contains(kiss)': False, u'contains(needed)': False, ...... u'contains(happens)': False, u'contains(suddenly)': False, u'contains(almost)': False, u'contains(evil)': False, u'contains(building)': False, u'contains(michael)': False, ...} # Naivebayes classifier get the wrong result In [25]: print classifier.classify(test_set) neg # Maxent Classifier done right In [26]: print maxent_classifier.classify(test_set) pos # Let's see the probability result In [27]: prob_result = classifier.prob_classify(test_set) In [28]: prob_result Out[28]: <ProbDist with 2 samples> In [29]: prob_result.max() Out[29]: u'neg' In [30]: prob_result.prob("neg") Out[30]: 0.99999917093621 In [31]: prob_result.prob("pos") Out[31]: 8.29063793272753e-07 # Maxent classifier probability result In [32]: print maxent_classifier.classify(test_set) pos In [33]: prob_result = maxent_classifier.prob_classify(test_set) In [33]: prob_result.prob("pos") Out[33]: 0.67570114045832497 In [34]: prob_result.prob("neg") Out[34]: 0.32429885954167498 |
Till now, we just used the top-n word features, and for this sentiment analysis machine learning problem, add more features may be get better result. So we redesign the word features:
In [40]: def bag_of_words(words): ....: return dict([(word, True) for word in words]) ....: In [43]: data_sets = [(bag_of_words(d), c) for (d, c) in documents] In [44]: len(data_sets) Out[44]: 2000 In [45]: train_set, test_set = data_sets[100:], data_sets[:100] In [46]: bayes_classifier = nltk.NaiveBayesClassifier.train(train_set) In [47]: print nltk.classify.accuracy(bayes_classifier, test_set) 0.8 In [48]: bayes_classifier.show_most_informative_features(10) Most Informative Features outstanding = True pos : neg = 13.9 : 1.0 avoids = True pos : neg = 13.1 : 1.0 astounding = True pos : neg = 11.7 : 1.0 insipid = True neg : pos = 11.0 : 1.0 3000 = True neg : pos = 11.0 : 1.0 insulting = True neg : pos = 10.6 : 1.0 manipulation = True pos : neg = 10.4 : 1.0 fascination = True pos : neg = 10.4 : 1.0 slip = True pos : neg = 10.4 : 1.0 ludicrous = True neg : pos = 10.1 : 1.0 In [49]: maxent_bg_classifier = nltk.MaxentClassifier.train(train_set, "megam") Scanning file...1900 train, 0 dev, 0 test, reading...done Warning: there only appear to be two classes, but we're optimizing with BFGS...using binary optimization with CG would be much faster optimizing with lambda = 0 it 1 dw 1.255e-01 pp 3.91521e-01 er 0.15368 it 2 dw 1.866e-02 pp 3.82995e-01 er 0.14684 it 3 dw 3.912e-02 pp 3.46794e-01 er 0.13368 it 4 dw 5.916e-02 pp 3.26135e-01 er 0.13684 it 5 dw 2.929e-02 pp 3.23077e-01 er 0.13474 it 6 dw 2.552e-02 pp 3.15917e-01 er 0.13526 it 7 dw 2.765e-02 pp 3.14291e-01 er 0.13526 it 8 dw 8.298e-02 pp 2.35472e-01 er 0.07263 it 9 dw 1.357e-01 pp 2.20265e-01 er 0.08684 it 10 dw 6.186e-02 pp 2.03567e-01 er 0.07158 it 11 dw 2.057e-01 pp 1.69049e-01 er 0.05316 it 12 dw 1.319e-01 pp 1.61575e-01 er 0.05263 it 13 dw 8.872e-02 pp 1.59902e-01 er 0.05526 it 14 dw 5.907e-02 pp 1.59254e-01 er 0.05632 it 15 dw 4.443e-02 pp 1.54540e-01 er 0.05368 it 16 dw 3.677e-01 pp 1.48646e-01 er 0.03842 it 17 dw 2.500e-01 pp 1.47460e-01 er 0.03947 it 18 dw 9.548e-01 pp 1.44516e-01 er 0.03842 it 19 dw 3.466e-01 pp 1.42935e-01 er 0.04211 it 20 dw 1.872e-02 pp 1.42847e-01 er 0.04263 it 21 dw 1.452e-01 pp 1.28344e-01 er 0.02737 it 22 dw 1.248e-01 pp 1.24428e-01 er 0.02526 it 23 dw 4.071e-01 pp 1.18201e-01 er 0.02211 it 24 dw 3.979e-01 pp 1.08352e-01 er 0.01526 it 25 dw 1.871e-01 pp 1.08345e-01 er 0.01632 it 26 dw 8.477e-02 pp 1.07972e-01 er 0.01579 it 27 dw 0.000e+00 pp 1.07972e-01 er 0.01579 ------------------------- ....... ------------------------- it 12 dw 4.018e-02 pp 1.73432e-05 er 0.00000 it 13 dw 3.898e-02 pp 1.62334e-05 er 0.00000 it 14 dw 9.937e-02 pp 1.52647e-05 er 0.00000 it 15 dw 5.558e-02 pp 1.31892e-05 er 0.00000 it 16 dw 5.646e-02 pp 1.30511e-05 er 0.00000 it 17 dw 1.100e-01 pp 1.23914e-05 er 0.00000 it 18 dw 4.541e-02 pp 1.17382e-05 er 0.00000 it 19 dw 1.316e-01 pp 1.04446e-05 er 0.00000 it 20 dw 1.919e-01 pp 9.04729e-06 er 0.00000 it 21 dw 1.039e-02 pp 9.02896e-06 er 0.00000 it 22 dw 2.843e-01 pp 8.92068e-06 er 0.00000 it 23 dw 1.100e-01 pp 8.54637e-06 er 0.00000 it 24 dw 2.199e-01 pp 8.36371e-06 er 0.00000 it 25 dw 2.428e-02 pp 8.24041e-06 er 0.00000 it 26 dw 0.000e+00 pp 8.24041e-06 er 0.00000 In [50]: print nltk.classify.accuracy(maxent_bg_classifier, test_set) 0.89 In [51]: maxent_bg_classifier.show_most_informative_features(10) -4.151 get==True and label is u'neg' -2.961 get==True and label is u'pos' -2.596 all==True and label is u'neg' -2.523 out==True and label is u'pos' -2.400 years==True and label is u'neg' -2.397 its==True and label is u'pos' -2.340 them==True and label is u'neg' -2.327 out==True and label is u'neg' -2.324 ,==True and label is u'neg' -2.259 (==True and label is u'neg' |
Now we can test the bigrams feature in the classifier model:
In [52]: from nltk import ngrams In [53]: def bag_of_ngrams(words, n=2): ....: ngs = [ng for ng in iter(ngrams(words, n))] ....: return bag_of_words(ngs) ....: In [54]: data_sets = [(bag_of_ngrams(d), c) for (d, c) in documents] In [55]: train_set, test_set = data_sets[100:], data_sets[:100] In [56]: nb_bi_classifier = nltk.NaiveBayesClassifier.train(train_set) In [57]: print nltk.classify.accuracy(nb_bi_classifier, test_set) 0.83 In [59]: nb_bi_classifier.show_most_informative_features(10) Most Informative Features (u'is', u'terrific') = True pos : neg = 17.1 : 1.0 (u'not', u'funny') = True neg : pos = 16.9 : 1.0 (u'boring', u'and') = True neg : pos = 13.6 : 1.0 (u'and', u'boring') = True neg : pos = 13.6 : 1.0 (u'our', u'own') = True pos : neg = 13.1 : 1.0 (u'why', u'did') = True neg : pos = 12.9 : 1.0 (u'enjoyable', u',') = True pos : neg = 12.4 : 1.0 (u'works', u'well') = True pos : neg = 12.4 : 1.0 (u'.', u'cameron') = True pos : neg = 12.4 : 1.0 (u'well', u'worth') = True pos : neg = 12.4 : 1.0 In [60]: maxent_bi_classifier = nltk.MaxentClassifier.train(train_set, "megam") Scanning file...1900 train, 0 dev, 0 test, reading...done Warning: there only appear to be two classes, but we're optimizing with BFGS...using binary optimization with CG would be much faster optimizing with lambda = 0 it 1 dw 6.728e-02 pp 4.68710e-01 er 0.25895 it 2 dw 6.127e-02 pp 3.37578e-01 er 0.13789 it 3 dw 1.712e-02 pp 2.94106e-01 er 0.11737 it 4 dw 2.538e-02 pp 2.68465e-01 er 0.11526 it 5 dw 3.965e-02 pp 2.46789e-01 er 0.10684 it 6 dw 1.240e-01 pp 1.98149e-01 er 0.07947 it 7 dw 1.640e-02 pp 1.62956e-01 er 0.05895 it 8 dw 1.320e-01 pp 1.07163e-01 er 0.02789 it 9 dw 1.233e-01 pp 8.79358e-02 er 0.01368 it 10 dw 2.815e-01 pp 5.51191e-02 er 0.00737 it 11 dw 1.127e-01 pp 3.91500e-02 er 0.00421 it 12 dw 3.463e-01 pp 2.95846e-02 er 0.00211 it 13 dw 1.114e-01 pp 2.90701e-02 er 0.00053 it 14 dw 1.453e-01 pp 1.95422e-02 er 0.00053 it 15 dw 1.976e-01 pp 1.54022e-02 er 0.00105 ...... it 44 dw 2.544e-01 pp 9.05755e-15 er 0.00000 it 45 dw 4.974e-02 pp 9.02763e-15 er 0.00000 it 46 dw 9.311e-07 pp 9.02483e-15 er 0.00000 it 47 dw 0.000e+00 pp 9.02483e-15 er 0.00000 In [61]: print nltk.classify.accuracy(maxent_bi_classifier, test_set) 0.9 In [62]: maxent_bi_classifier.show_most_informative_features(10) -14.152 (u'a', u'man')==True and label is u'neg' 12.821 (u'"', u'the')==True and label is u'neg' -12.399 (u'of', u'the')==True and label is u'neg' -11.881 (u'a', u'man')==True and label is u'pos' 10.020 (u',', u'which')==True and label is u'neg' 8.418 (u'and', u'that')==True and label is u'neg' -8.022 (u'and', u'the')==True and label is u'neg' -7.191 (u'on', u'a')==True and label is u'neg' -7.185 (u'on', u'a')==True and label is u'pos' 7.107 (u',', u'which')==True and label is u'pos' |
And again, we can use the words feature and ngrams (bigrams) feature together:
In [63]: def bag_of_all(words, n=2): ....: all_features = bag_of_words(words) ....: ngram_features = bag_of_ngrams(words, n=n) ....: all_features.update(ngram_features) ....: return all_features ....: In [64]: data_sets = [(bag_of_all(d), c) for (d, c) in documents] In [65]: train_set, test_set = data_sets[100:], data_sets[:100] In [66]: nb_all_classifier = nltk.NaiveBayesClassifier.train(train_set) In [67]: print nltk.classify.accuracy(nb_all_classifier, test_set) 0.83 In [68]: nb_all_classifier.show_most_informative_features(10) Most Informative Features (u'is', u'terrific') = True pos : neg = 17.1 : 1.0 (u'not', u'funny') = True neg : pos = 16.9 : 1.0 outstanding = True pos : neg = 13.9 : 1.0 (u'boring', u'and') = True neg : pos = 13.6 : 1.0 (u'and', u'boring') = True neg : pos = 13.6 : 1.0 avoids = True pos : neg = 13.1 : 1.0 (u'our', u'own') = True pos : neg = 13.1 : 1.0 (u'why', u'did') = True neg : pos = 12.9 : 1.0 (u'enjoyable', u',') = True pos : neg = 12.4 : 1.0 (u'works', u'well') = True pos : neg = 12.4 : 1.0 In [71]: maxent_all_classifier = nltk.MaxentClassifier.train(train_set, "megam") Scanning file...1900 train, 0 dev, 0 test, reading...done Warning: there only appear to be two classes, but we're optimizing with BFGS...using binary optimization with CG would be much faster optimizing with lambda = 0 it 1 dw 8.715e-02 pp 3.82841e-01 er 0.17684 it 2 dw 2.846e-02 pp 2.97371e-01 er 0.11632 it 3 dw 1.299e-02 pp 2.79797e-01 er 0.11421 it 4 dw 2.456e-02 pp 2.64735e-01 er 0.11053 it 5 dw 4.200e-02 pp 2.47440e-01 er 0.10789 it 6 dw 1.417e-01 pp 2.04814e-01 er 0.08737 it 7 dw 1.330e-02 pp 2.03060e-01 er 0.08737 it 8 dw 3.177e-02 pp 1.92654e-01 er 0.08421 it 9 dw 5.613e-02 pp 1.38725e-01 er 0.05789 it 10 dw 1.339e-01 pp 7.92844e-02 er 0.02368 it 11 dw 1.734e-01 pp 6.71341e-02 er 0.01316 it 12 dw 1.313e-01 pp 6.55828e-02 er 0.01263 it 13 dw 2.036e-01 pp 6.38482e-02 er 0.01421 it 14 dw 1.230e-02 pp 5.96907e-02 er 0.01368 it 15 dw 9.719e-02 pp 4.03190e-02 er 0.00842 it 16 dw 4.004e-02 pp 3.98276e-02 er 0.00737 it 17 dw 1.598e-01 pp 2.68187e-02 er 0.00316 it 18 dw 1.900e-01 pp 2.57116e-02 er 0.00211 it 19 dw 4.355e-01 pp 2.14572e-02 er 0.00263 it 20 dw 1.029e-01 pp 1.91407e-02 er 0.00211 it 21 dw 1.347e-01 pp 1.46859e-02 er 0.00105 it 22 dw 2.231e-01 pp 1.26997e-02 er 0.00053 it 23 dw 2.942e-01 pp 1.20663e-02 er 0.00000 it 24 dw 3.836e-01 pp 1.14817e-02 er 0.00000 it 25 dw 4.213e-01 pp 9.89037e-03 er 0.00000 it 26 dw 1.875e-01 pp 7.06744e-03 er 0.00000 it 27 dw 2.865e-01 pp 5.61255e-03 er 0.00000 it 28 dw 5.903e-01 pp 4.94776e-03 er 0.00000 it 29 dw 0.000e+00 pp 4.94776e-03 er 0.00000 ------------------------- ....... ------------------------- ....... it 8 dw 2.024e-01 pp 8.14623e-10 er 0.00000 it 9 dw 9.264e-02 pp 7.87683e-10 er 0.00000 it 10 dw 5.845e-02 pp 7.38397e-10 er 0.00000 it 11 dw 2.418e-01 pp 6.34000e-10 er 0.00000 it 12 dw 5.081e-01 pp 6.19061e-10 er 0.00000 it 13 dw 0.000e+00 pp 6.19061e-10 er 0.00000 In [72]: print nltk.classify.accuracy(maxent_all_classifier, test_set) 0.91 In [73]: maxent_all_classifier.show_most_informative_features(10) 11.220 to==True and label is u'neg' 3.415 ,==True and label is u'neg' 3.360 '==True and label is u'neg' 3.310 this==True and label is u'neg' 3.243 a==True and label is u'neg' -3.218 (u'does', u'a')==True and label is u'neg' 3.164 have==True and label is u'neg' -3.024 what==True and label is u'neg' 2.966 more==True and label is u'neg' -2.891 (u',', u'which')==True and label is u'neg' |
We get the best sentiment analysis performance on this case, although there were some problems, such as the punctuations and stop words were not discarded. Just as a case study, we encourage you to test on more data or more features, or better machine learning models such as deep learning method.
Posted by TextMiner
Thanks for sharing this valuable information