HomeSentiment AnalysisfastText for Fast Sentiment Analysis

fastText is a Library for fast text representation and classification which recently launched by facebookresearch team. The related papers are “Enriching Word Vectors with Subword Information” and “Bag of Tricks for Efficient Text Classification“. I’m very interested in the text classification method in the fastText, so test it with the Large Movie Review Dataset for fast sentiment analysis.

First get the fastText code and compile it:

git clone https://github.com/facebookresearch/fastText.git
cd fastText
make

c++ -pthread -O3 -funroll-loops -std=c++0x -c src/args.cc
c++ -pthread -O3 -funroll-loops -std=c++0x -c src/dictionary.cc
c++ -pthread -O3 -funroll-loops -std=c++0x -c src/matrix.cc
c++ -pthread -O3 -funroll-loops -std=c++0x -c src/vector.cc
c++ -pthread -O3 -funroll-loops -std=c++0x -c src/model.cc
c++ -pthread -O3 -funroll-loops -std=c++0x -c src/utils.cc
c++ -pthread -O3 -funroll-loops -std=c++0x args.o dictionary.o matrix.o vector.o model.o utils.o src/fasttext.cc -o fasttext

Then run the classification-example.sh:

sh -x classification-example.sh

+ RESULTDIR=result
+ DATADIR=data
+ mkdir -p result
+ mkdir -p data
+ [ ! -f data/dbpedia.train ]
+ wget -c https://googledrive.com/host/0Bz8a_Dbh9QhbQ2Vic1kxMmZZQ1k -O data/dbpedia_csv.tar.gz
--2016-08-07 16:50:33-- https://googledrive.com/host/0Bz8a_Dbh9QhbQ2Vic1kxMmZZQ1k
+ tar -xzvf data/dbpedia_csv.tar.gz -C data
dbpedia_csv/
dbpedia_csv/classes.txt
dbpedia_csv/test.csv
dbpedia_csv/train.csv
dbpedia_csv/readme.txt
+ cat data/dbpedia_csv/train.csv
+ normalize_text
+ tr [:upper:] [:lower:]
+ sed -e s/^/__label__/g
+ sed -e s/'/ ' /g -e s/"//g -e s/\./ \. /g -e s/
/ /g -e s/,/ , /g -e s/(/ ( /g -e s/)/ ) /g -e s/\!/ \! /g -e s/\?/ \? /g -e s/\;/ /g -e s/\:/ /g
+ tr -s
+ myshuf
+ perl -MList::Util=shuffle -e print shuffle(<>);
+ cat data/dbpedia_csv/test.csv
+ normalize_text
+ sed -e s/^/__label__/g
+ sed -e s/'/ ' /g -e s/"//g+ -e s/\./ \. /gtr -e s/
/ /g [:upper:] -e [:lower:] s/,/ , /g
-e s/(/ ( /g -e s/)/ ) /g -e s/\!/ \! /g -e s/\?/ \? /g -e s/\;/ /g -e s/\:/ /g
+ tr -s
+ myshuf
+ perl -MList::Util=shuffle -e print shuffle(<>);
+ make
make: Nothing to be done for 'all'.
+ ./fasttext supervised -input data/dbpedia.train -output result/dbpedia -dim 10 -lr 0.1 -wordNgrams 2 -minCount 1 -bucket 10000000 -epoch 5 -thread 4
Read 32M words
Progress: 0.0% words/sec/thread: 338554 lr: 0.099994 loss: 2.654805 eta: 0h2Progress: 0.0% words/sec/thread: 497461 lr: 0.099982 loss: 2.654805 eta: 0h1Progress: 0.0% words/sec/thread: 713565 lr: 0.099957 loss: 2.654805 eta: 0h0Progress: 0.1% words/sec/thread: 824608 lr: 0.099933 loss: 2.654805 eta: 0h0Progress: 0.1% words/sec/thread: 956788 lr: 0.099902 loss: 2.654805 eta: 0h0Progress: 0.1% words/sec/thread: 1002078 lr: 0.099878 loss: 2.654805 eta: 0hProgress: 0.1% words/sec/thread: 1042413 lr: 0.099853 loss: 2.654727 eta: 0hProgress: 0.2% words/sec/thread: 1113521 lr: 0.099823 loss: 2.654189 eta: 0h
......
Train time: 15.000000 sec
+ ./fasttext test result/dbpedia.bin data/dbpedia.test
P@1: 0.984
Number of examples: 70000
+ ./fasttext predict result/dbpedia.bin data/dbpedia.test

dbpedia_csv.tar.gz data is The DBpedia ontology classification dataset:

The DBpedia ontology classification dataset is constructed by picking 14 non-overlapping classes from DBpedia 2014. They are listed in classes.txt. From each of thse 14 ontology classes, we randomly choose 40,000 training samples and 5,000 testing samples. Therefore, the total size of the training dataset is 560,000 and testing dataset 70,000.

The files train.csv and test.csv contain all the training samples as comma-sparated values. There are 3 columns in them, corresponding to class index (1 to 14), title and content. The title and content are escaped using double quotes (“), and any internal double quote is escaped by 2 double quotes (“”). There are no new lines in title or content.

Back to the Large Movie Review Dataset: aclImdb_v1.tar.gz, we can follow the steps to get a fast sentiment analysis result:

The core dataset contains 50,000 reviews split evenly into 25k train and 25k test sets. The overall distribution of labels is balanced (25k pos and 25k neg).

First we write a script to preprocess the positive and negative movie reviews data to output_train and output_test data:

head -n 10 output_train

pos,Bromwell High is a cartoon comedy. It ran at the same time as some other programs about school life, such as “Teachers”. My 35 years in the teaching profession lead me to believe that Bromwell High’s satire is much closer to reality than is “Teachers”. The scramble to survive financially, the insightful students who can see right through their pathetic teachers’ pomp, the pettiness of the whole situation, all remind me of the schools I knew and their students. When I saw the episode in which a student repeatedly tried to burn down the school, I immediately recalled ……… at ………. High. A classic line: INSPECTOR: I’m here to sack one of your teachers. STUDENT: Welcome to Bromwell High. I expect that many adults of my age think that Bromwell High is far fetched. What a pity that it isn’t!
……

head -n 10 output_test

pos,I went and saw this movie last night after being coaxed to by a few friends of mine. I’ll admit that I was reluctant to see it because from what I knew of Ashton Kutcher he was only able to do comedy. I was wrong. Kutcher played the character of Jake Fischer very well, and Kevin Costner played Ben Randall with such professionalism. The sign of a good movie is that it can toy with our emotions. This one did exactly that. The entire theater (which was sold out) was overcome by laughter during the first half of the movie, and were moved to tears during the second half. While exiting the theater I not only saw many women in tears, but many full grown men as well, trying desperately not to let anyone see them crying. This movie was great, and I suggest that you go see it before you judge.
……

Then we copy and modify classification-example.sh to classification-lmdb-example.sh:

#!/bin/sh
#
# Copyright (c) 2016-present, Facebook, Inc.
# All rights reserved.
#
# This source code is licensed under the BSD-style license found in the
# LICENSE file in the root directory of this source tree. An additional grant
# of patent rights can be found in the PATENTS file in the same directory.
#

myshuf() {
  perl -MList::Util=shuffle -e 'print shuffle(<>);' "$@";
}

normalize_text() {
  tr '[:upper:]' '[:lower:]' | sed -e 's/^/__label__/g' | \
    sed -e "s/'/ ' /g" -e 's/"//g' -e 's/\./ \. /g' -e 's/
/ /g' \ -e 's/,/ , /g' -e 's/(/ ( /g' -e 's/)/ ) /g' -e 's/\!/ \! /g' \ -e 's/\?/ \? /g' -e 's/\;/ /g' -e 's/\:/ /g' | tr -s " " | myshuf } RESULTDIR=lmdbresult DATADIR=lmdbdata mkdir -p "${RESULTDIR}" mkdir -p "${DATADIR}" if [ ! -f "${DATADIR}/lmdb.train" ] then cat "${DATADIR}/output_train" | normalize_text > "${DATADIR}/lmdb.train" cat "${DATADIR}/output_test" | normalize_text > "${DATADIR}/lmdb.test" fi ./fasttext supervised -input "${DATADIR}/lmdb.train" -output "${RESULTDIR}/lmdb" -dim 10 -lr 0.1 -wordNgrams 2 -minCount 1 -bucket 10000000 -epoch 5 -thread 4 ./fasttext test "${RESULTDIR}/lmdb.bin" "${DATADIR}/lmdb.test" ./fasttext predict "${RESULTDIR}/lmdb.bin" "${DATADIR}/lmdb.test" > "${RESULTDIR}/lmdb.test.predict"

Here we used the same normalize_text() for lmdb train data and test data and add __label__ for the pos or neg tag:

textminer@textminer:~/text/fastText/lmdbdata$ wc -l lmdb.train
25000 lmdb.train
textminer@textminer:~/text/fastText/lmdbdata$ head -n 1 lmdb.train
__label__neg , a female vampire kills young women and paints with their blood . she has an assistant who doesn ‘ t want to be a vampire , so he has to do what she orders or be turned into a blood sucker . after a few kills , the assistant gets remorse and falls in love with a homeless girl . what can i say about this movie ? that its pacing is over-slow , that it has some strange sound effects ( never a bite sounded so strange ) and ambiance ( new jazz here i come ) and that lights don ‘ t seem to be included on the set . it looks like an auteur horror movie with all the self-sufficiency inside . the plot is completely stupid and as you can guess , it ‘ s the female vampire who explains how to kill her even if she doesn ‘ t have to do it of course , crosses , light , garlic and sticks don ‘ t work . it ‘ s not even a funny lousy movie . perhaps with some friends and a lot of beers , it can ‘ t have its funny sides ( to be honest , it ‘ s funny during 10 – 15 minutes near the end of the movie ) . don ‘ t be fooled by the troma sticker , it ‘ s one the bad movie they present .
textminer@textminer:~/text/fastText/lmdbdata$ wc -l lmdb.test
25000 lmdb.test
textminer@textminer:~/text/fastText/lmdbdata$ head -n 1 lmdb.test
__label__pos , this is the true story of the great pianist and jazz singer/legend ray charles ( oscar , bafta and golden globe winning jamie foxx ) . he was born in a poor african american-town , and he went blind at 7 years old , but with his skills of touch and hearing , this is what would later in life would lead him to stardom . by the 1960 ‘ s he had accomplished his dream , and selling records in millions , and leading the charts with songs and albums . but the story also showed his downfalls , including the separation from his wife and child , because of his affair with a band member , his drug and alcohol use , and going to prison because of this . also starring regina king as margie hendricks , kerry washington as della bea robinson , clifton powell as jeff brown , harry j . lennix as joe adams , bokeem woodbine as fathead newman , aunjanue ellis as mary ann fisher , sharon warren as aretha robinson , c . j . sanders as young ray robinson , curtis armstrong as ahmet ertegun and richard schiff as jerry wexler . it is a great story with a great singer impression , the songs , including hit the road jack , are the highlights . it won the oscar for best sound mixing , and it was nominated for best costume design , best director for taylor hackford , best editing and best motion picture of the year , it won the bafta for best sound , and it was nominated for the anthony asquith award for film music for craig armstrong and best original screenplay , and it was nominated the golden globe for best motion picture – musical or comedy . it was number 99 on 100 years , 100 cheers . very good !

The data has been shuffled. Run the script, we get:

sh -x classification-lmdb-example.sh
+ RESULTDIR=lmdbresult
+ DATADIR=lmdbdata
+ mkdir -p lmdbresult
+ mkdir -p lmdbdata
+ [ ! -f lmdbdata/lmdb.train ]
+ cat lmdbdata/output_train
+ normalize_text
+ tr -s
+ tr [:upper:] [:lower:]+
sed -e+ s/^/__label__/g
myshuf
+ perl -MList::Util=shuffle -e print shuffle(<>);
+ sed -e s/'/ ' /g -e s/"//g -e s/\./ \. /g -e s/
/ /g -e s/,/ , /g -e s/(/ ( /g -e s/)/ ) /g -e s/\!/ \! /g -e s/\?/ \? /g -e s/\;/ /g -e s/\:/ /g
+ cat lmdbdata/output_test
+ normalize_text
+ tr [:upper:] [:lower:]
+ sed -e s/^/__label__/g
+ sed+ -e s/'/ ' /g -e s/"//g -e s/\./ \. /g -e s/
/ /gtr -e s/,/ , /g -s -e s/(/ ( /g
-e s/)/ ) /g -e s/\!/ \! /g -e s/\?/ \? /g -e s/\;/ /g -e s/\:/ /g
+ myshuf
+ perl -MList::Util=shuffle -e print shuffle(<>);
+ ./fasttext supervised -input lmdbdata/lmdb.train -output lmdbresult/lmdb -dim 10 -lr 0.1 -wordNgrams 2 -minCount 1 -bucket 10000000 -epoch 5 -thread 4
Read 6M words
Progress: 0.0% words/sec/thread: 567700 lr: 0.099969 loss: 0.695051 eta: 0h0Progress: 0.1% words/sec/thread: 664640 lr: 0.099909 loss: 0.694964 eta: 0h0Progress: 0.2% words/sec/thread: 749614 lr: 0.099820 loss: 0.694986 eta: 0h0Progress: 0.3%
......
Progress: 98.1% words/sec/thread: 2893504 lr: 0.001874 loss: 0.459213 eta: 0Progress: 98.2% words/sec/thread: 2892876 lr: 0.001786 loss: 0.459080 eta: 0Progress: 98.3% words/sec/thread: 2893128 lr: 0.001666 loss: 0.458800 eta: 0Progress: 98.5% words/sec/thread: 2894126 lr: 0.001517 loss: 0.458699 eta: 0Progress: 98.6% words/sec/thread: 2894408 lr: 0.001398 loss: 0.458508 eta: 0Progress: 98.7% words/sec/thread: 2894636 lr: 0.001279 loss: 0.458280 eta: 0Progress: 98.8% words/sec/thread: 2894771 lr: 0.001158 loss: 0.458114 eta: 0Progress: 99.0% words/sec/thread: 2895030 lr: 0.001040 loss: 0.457879 eta: 0Progress: 99.1% words/sec/thread: 2895397 lr: 0.000919 loss: 0.457723 eta: 0Progress: 99.2% words/sec/thread: 2895666 lr: 0.000799 loss: 0.457522 eta: 0Progress: 99.3% words/sec/thread: 2895922 lr: 0.000680 loss: 0.457340 eta: 0Progress: 99.4% words/sec/thread: 2896141 lr: 0.000561 loss: 0.457192 eta: 0Progress: 99.6% words/sec/thread: 2896448 lr: 0.000441 loss: 0.457020 eta: 0Progress: 99.7% words/sec/thread: 2896746 lr: 0.000322 loss: 0.456832 eta: 0Progress: 99.8% words/sec/thread: 2896189 lr: 0.000233 loss: 0.456715 eta: 0Progress: 99.9% words/sec/thread: 2897281 lr: 0.000084 loss: 0.456515 eta: 0Progress: 100.0% words/sec/thread: 2896822 lr: 0.000001 loss: 0.456412 eta: Progress: 100.0% words/sec/thread: 2896817 lr: 0.000001 loss: 0.456412 eta: 0h0m
Train time: 3.000000 sec
+ ./fasttext test lmdbresult/lmdb.bin lmdbdata/lmdb.test
P@1: 0.86
Number of examples: 25000
+ ./fasttext predict lmdbresult/lmdb.bin lmdbdata/lmdb.test

The train time is really fast, and the accuracy for this sentiment analysis is almost compared with other methods.

Posted by TextMiner


Comments

fastText for Fast Sentiment Analysis — No Comments

Leave a Reply

Your email address will not be published. Required fields are marked *