Deep Learning Practice for NLP: Large Movie Review Data Sentiment Analysis from Scratch
Deep Learning with Python is a very good book recently I have read:
Deep Learning with Python introduces the field of deep learning using the Python language and the powerful Keras library. Written by Keras creator and Google AI researcher François Chollet, this book builds your understanding through intuitive explanations and practical examples. You’ll explore challenging concepts and practice with applications in computer vision, natural-language processing, and generative models. By the time you finish, you’ll have the knowledge and hands-on skills to apply deep learning in your own projects.
Everything is ok, especially the example data are prepared very well, so you can play with sentiment analysis within 20 lines. But most of time, in practical work, you should do more with the original data, like clean data, preprocess data with tokenize, lowercase and etc. So I have launched this series: Deep Learning Practice for NLP, which focus on learning by doing NLP with Deep Learning from scratch. The first example came from the chapter 3.4: Classifying movie reviews-a binary classification example, which can be seen as a simple sentiment analysis task.
First introduce the original movie review dataset aclIMDB: Large Movie Review Dataset, launched in 2011 by Stanford University Artificial Intelligence Lab, which contains 25,000 training data and 25,000 test data, and other 50,000 unlabeled data. The training set and the test set each contain 12,500 positive examples and 12,500 negative examples. For more detailed description of the data, you can refer to the official website of the data set: http://ai.stanford.edu/~amaas/data/sentiment/, paper: Learning Word Vectors for Sentiment Analysis, and readme in the dataset.
Then download and process this data: Large Movie Review Dataset v1.0, download link;
http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
tar -zxvf aclImdb_v1.tar.gz
and view the directory structure by tree command:
tree aclImdb -L 2
cd aclImdb/train/pos/ and there are lots of text files:
Let’s see one of the files: vim 4321_8.txt
I’ll keep this one quite short. I believe that this is an extraordinary movie. I see other reviewers who have commented to the effect that it’s badly written, poorly shot, has a terrible soundtrack and, worse, that it’s not real in its portrayal of life. OK, so it may not be quite believable for its whole length, but this movie carries a message of hope which some others seemed to have missed. Hope that it isn’t too late to save people from the terrible things that go on in so many lives. Gangland violence is real, right? Is it right, no! This movie carries an important social message which the cynics may dislike but which nonetheless is to be praised, rather than denigrated. I have watched this movie with great enjoyment at least eight times, each time with equal enjoyment and each time with the feeling that maybe the world could be made better and is not beyond saving (well not until 2008 anyway). 9 out of 10 from me for this one. It’s very nearly perfect in my view. JMV
Before processing those original data, you should think about the data process purpose? Here I mainly want to reuse the official interface of Keras, which provides a Python script imdb.py to load the pre-process imdb dta, but (seemingly) does not provide a script to make those data. The script mainly reads two data, one is imdb_word_index.json and the other is imdb.npz. The imdb_word_inde.json is a word index file, sorted by word frequency, the first word index is “the”, you can look it here:
The imdb.npz is a Numpy NPZ file, which stores multiple numpy array. Here mainly includes the training set and test set of imdb based on the above word index file converted to id data. Let’s take a look:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 | In [1]: import numpy as np In [2]: f = np.load('imdb.npz') In [3]: f.keys() Out[3]: ['x_test', 'x_train', 'y_train', 'y_test'] In [4]: x_train, y_train, x_test, y_test = f['x_train'], f['y_train'], f['x_test'], f['y_test'] In [5]: len(x_train), len(y_train), len(x_test), len(y_test) Out[5]: (25000, 25000, 25000, 25000) In [6]: x_train.shape Out[6]: (25000,) In [7]: y_train.shape Out[7]: (25000,) ... In [12]: x_train[0:2] Out[12]: array([ [23022, 309, 6, 3, 1069, 209, 9, 2175, 30, 1, 169, 55, 14, 46, 82, 5869, 41, 393, 110, 138, 14, 5359, 58, 4477, 150, 8, 1, 5032, 5948, 482, 69, 5, 261, 12, 23022, 73935, 2003, 6, 73, 2436, 5, 632, 71, 6, 5359, 1, 25279, 5, 2004, 10471, 1, 5941, 1534, 34, 67, 64, 205, 140, 65, 1232, 63526, 21145, 1, 49265, 4, 1, 223, 901, 29, 3024, 69, 4, 1, 5863, 10, 694, 2, 65, 1534, 51, 10, 216, 1, 387, 8, 60, 3, 1472, 3724, 802, 5, 3521, 177, 1, 393, 10, 1238, 14030, 30, 309, 3, 353, 344, 2989, 143, 130, 5, 7804, 28, 4, 126, 5359, 1472, 2375, 5, 23022, 309, 10, 532, 12, 108, 1470, 4, 58, 556, 101, 12, 23022, 309, 6, 227, 4187, 48, 3, 2237, 12, 9, 215], [23777, 39, 81226, 14, 739, 20387, 3428, 44, 74, 32, 1831, 15, 150, 18, 112, 3, 1344, 5, 336, 145, 20, 1, 887, 12, 68, 277, 1189, 403, 34, 119, 282, 36, 167, 5, 393, 154, 39, 2299, 15, 1, 548, 88, 81, 101, 4, 1, 3273, 14, 40, 3, 413, 1200, 134, 8208, 41, 180, 138, 14, 3086, 1, 322, 20, 4930, 28948, 359, 5, 3112, 2128, 1, 20045, 19339, 39, 8208, 45, 3661, 27, 372, 5, 127, 53, 20, 1, 1983, 7, 7, 18, 48, 45, 22, 68, 345, 3, 2131, 5, 409, 20, 1, 1983, 15, 3, 3238, 206, 1, 31645, 22, 277, 66, 36, 3, 341, 1, 719, 729, 3, 3865, 1265, 20, 1, 1510, 3, 1219, 2, 282, 22, 277, 2525, 5, 64, 48, 42, 37, 5, 27, 3273, 12, 6, 23030, 75120, 2034, 7, 7, 3771, 3225, 34, 4186, 34, 378, 14, 12583, 296, 3, 1023, 129, 34, 44, 282, 8, 1, 179, 363, 7067, 5, 94, 3, 2131, 16, 3, 5211, 3005, 15913, 21720, 5, 64, 45, 26, 67, 409, 8, 1, 1983, 15, 3261, 501, 206, 1, 31645, 45, 12583, 2877, 26, 67, 78, 48, 26, 491, 16, 3, 702, 1184, 4, 228, 50, 4505, 1, 43259, 20, 118, 12583, 6, 1373, 20, 1, 887, 16, 3, 20447, 20, 24, 3964, 5, 10455, 24, 172, 844, 118, 26, 188, 1488, 122, 1, 6616, 237, 345, 1, 13891, 32804, 31, 3, 39870, 100, 42, 395, 20, 24, 12130, 118, 12583, 889, 82, 102, 584, 3, 252, 31, 1, 400, 4, 4787, 16974, 1962, 3861, 32, 1230, 3186, 34, 185, 4310, 156, 2325, 38, 341, 2, 38, 9048, 7355, 2231, 4846, 2, 32880, 8938, 2610, 34, 23, 457, 340, 5, 1, 1983, 504, 4355, 12583, 215, 237, 21, 340, 5, 4468, 5996, 34689, 37, 26, 277, 119, 51, 109, 1023, 118, 42, 545, 39, 2814, 513, 39, 27, 553, 7, 7, 134, 1, 116, 2022, 197, 4787, 2, 12583, 283, 1667, 5, 111, 10, 255, 110, 4382, 5, 27, 28, 4, 3771, 12267, 16617, 105, 118, 2597, 5, 109, 3, 209, 9, 284, 3, 4325, 496, 1076, 5, 24, 2761, 154, 138, 14, 7673, 11900, 182, 5276, 39, 20422, 15, 1, 548, 5, 120, 48, 42, 37, 257, 139, 4530, 156, 2325, 9, 1, 372, 248, 39, 20, 1, 82, 505, 228, 3, 376, 2131, 37, 29, 1023, 81, 78, 51, 33, 89, 121, 48, 5, 78, 16, 65, 275, 276, 33, 141, 199, 9, 5, 1, 3273, 302, 4, 769, 9, 37, 17648, 275, 7, 7, 39, 276, 11, 19, 77, 6018, 22, 5, 336, 406]], dtype=object) In [13]: y_train[0:2] Out[13]: array([1, 1]) In [14]: x_test.shape Out[14]: (25000,) In [15]: y_test.shape Out[15]: (25000,) In [16]: x_test[0:2] Out[16]: array([ [10, 432, 2, 216, 11, 17, 233, 311, 100, 109, 27791, 5, 31, 3, 168, 366, 4, 1920, 634, 971, 12, 10, 13, 5523, 5, 64, 9, 85, 36, 48, 10, 694, 4, 13059, 15969, 26, 13, 61, 499, 5, 78, 209, 10, 13, 352, 15969, 253, 1, 106, 4, 3270, 14998, 52, 70, 2, 1839, 11762, 253, 1019, 7655, 16, 138, 12866, 1, 1910, 4, 3, 49, 17, 6, 12, 9, 67, 2885, 16, 260, 1435, 11, 28, 119, 615, 12, 1, 433, 747, 60, 13, 2959, 43, 13, 3080, 31, 2126, 312, 1, 83, 317, 4, 1, 17, 2, 68, 1678, 5, 1671, 312, 1, 330, 317, 134, 14200, 1, 747, 10, 21, 61, 216, 108, 369, 8, 1671, 18, 108, 365, 2068, 346, 14, 70, 266, 2721, 21, 5, 384, 256, 64, 95, 2575, 11, 17, 13, 84, 2, 10, 1464, 12, 22, 137, 64, 9, 156, 22, 1916], [281, 676, 164, 985, 5696, 1157, 53, 24, 2425, 2013, 1, 3357, 186, 11603, 16, 11, 220, 2572, 2252, 450, 41, 1, 21308, 1203, 587, 908, 118, 3, 182, 295, 47415, 5157, 36, 24, 4486, 975, 5, 294, 426, 24, 7117, 8, 48, 13, 2275, 14, 1, 830, 497, 123, 253, 143, 54, 334, 4, 8891, 2, 131, 10465, 9594, 2252, 1551, 23, 3, 9591, 3, 2517, 88, 1030, 221, 5, 1755, 959, 16, 4628, 2, 2376, 129, 18, 46, 86, 11, 19, 13, 8480, 29, 1, 169, 7, 7, 1, 19, 514, 16, 46, 1515, 633, 895, 835, 3, 51329, 307, 4, 1, 1122, 633, 895, 4, 27000, 49040, 2, 5544, 18, 35402, 364, 1361, 15, 91, 83, 31, 1, 1393, 531, 277, 1, 203, 1099, 5, 1, 1203, 587, 908, 180, 1258, 53, 52, 70, 5696, 124, 3, 324, 289, 2, 284, 3, 9408, 15, 1131, 3664, 15697, 10, 444, 1, 2514, 11836, 4223, 4, 1, 203, 20, 248, 104, 4, 1, 908, 12, 19323, 1, 111, 1034, 39, 760, 46, 2073, 1984, 1134, 5, 1, 3917, 222, 46, 1441, 106, 940, 51, 1, 695, 1332, 6, 2365, 31, 1215, 4, 1, 15171, 8, 325, 3672, 2, 347, 6085, 34, 2727, 24, 220, 17370, 14, 3, 503, 5, 94, 93, 15, 3, 8891, 262, 26, 79, 124, 3, 49, 289, 4, 2006, 5004, 48, 268, 20, 8, 1, 73329, 1825, 464, 5097, 8891, 3, 2146, 354, 4106, 6, 836, 6313, 1236, 130, 1106, 141, 79, 27, 345, 1, 267, 16132, 2, 2295, 2547, 15, 1852, 32, 1725, 807, 415, 838, 4, 1313, 2, 5788, 30, 1, 451, 4, 1, 10257, 1114, 7, 7, 22, 121, 86, 11, 6, 167, 5, 127, 21, 61, 85, 42, 445, 20, 3, 280, 62, 18, 79, 85, 105, 8, 11, 509, 791, 1, 169, 14212, 117, 2, 117, 18, 5696, 1454, 20, 3, 125, 71, 853, 120, 2, 379, 10442, 50, 673, 493, 1, 367, 71, 26, 123, 66, 8, 1008, 4, 9, 463, 1, 4374, 873, 11, 6, 3, 324, 2, 773, 19, 5, 3660, 15, 12, 1012, 5, 166, 32, 308]], dtype=object) In [17]: y_test[0:2] Out[17]: array([1, 1]) |
Now we can process the original aclimdb data according to the above ideas. I have built a github project: AINLP, and the subproject aclimdb_sentiment_analysis_from_scratch provides several Python scripts for processing the imdb data, which compatible with python2 and python3, but before you running these scripts, you need install the dependencies: requirement.txt
Numpy==1.15.2 Sacremoses==0.0.5 Six==1.11.0
The first step is to create word index json file, which is done by the script build_word_index.py, where only the data in the training set and test set are processed, ignoring unlabeled data(unsup):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 | #!/usr/bin/env python # -*- coding: utf-8 -*- # Author: TextMiner (textminer@foxmail.com) # Copyright 2018 @ AINLP from __future__ import absolute_import from __future__ import division from __future__ import print_function import argparse import json import numpy as np import re import six from collections import OrderedDict from os import walk from sacremoses import MosesTokenizer tokenizer = MosesTokenizer() def build_word_index(input_dir, output_json): word_count = OrderedDict() for root, dirs, files in walk(input_dir): for filename in files: if re.match(".*\d+_\d+.txt", filename): filepath = root + '/' + filename print(filepath) if 'unsup' in filepath: continue with open(filepath, 'r') as f: for line in f: if six.PY2: tokenize_words = tokenizer.tokenize( line.decode('utf-8').strip()) else: tokenize_words = tokenizer.tokenize(line.strip()) lower_words = [word.lower() for word in tokenize_words] for word in lower_words: if word not in word_count: word_count[word] = 0 word_count[word] += 1 words = list(word_count.keys()) counts = list(word_count.values()) sorted_idx = np.argsort(counts) sorted_words = [words[ii] for ii in sorted_idx[::-1]] word_index = OrderedDict() for ii, ww in enumerate(sorted_words): word_index[ww] = ii + 1 with open(output_json, 'w') as fp: json.dump(word_index, fp) if __name__ == '__main__': parser = argparse.ArgumentParser() parser.add_argument('-id', '--input_dir', type=str, nargs='?', default='./data/aclImdb/', help='input data directory') parser.add_argument('-ot', '--output_json', type=str, nargs='?', default='./data/aclimdb_word_index.json', help='output word index dict json') args = parser.parse_args() input_dir = args.input_dir output_json = args.output_json build_word_index(input_dir, output_json) |
Note my file structure is probably like this, if you are not, pls set the input and output file path:
Run:
python build_word_index.py
then you can get a word index file which generated in the data directory: aclimdb_word_index.json. Because the program uses OrderedDict, the json file after dump can also see the ordered word index. Note that the punctuations and html tags are not removed, you can try it in the future:
Next, we provide the second script build_data_index.py to process the train and test set, based on the word index file aclimdb_word_index.json generated by the previous script, which convert the plaintext of the train set and test set to a numeric id, and generate 4 numpy Arrays (x_train, y_train, x_test, y_test) and store in a npz file:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 | #!/usr/bin/env python # -*- coding: utf-8 -*- # Author: TextMiner (textminer@foxmail.com) # Copyright 2018 @ AINLP from __future__ import absolute_import from __future__ import division from __future__ import print_function import argparse import json import numpy as np import re import six from collections import OrderedDict from os import walk from sacremoses import MosesTokenizer tokenizer = MosesTokenizer() def get_word_index(word_index_path): with open(word_index_path) as f: return json.load(f) def build_data_index(input_dir, word_index): train_x = [] train_y = [] for root, dirs, files in walk(input_dir): for filename in files: if re.match(".*\d+_\d+.txt", filename): filepath = root + '/' + filename print(filepath) if 'pos' in filepath: train_y.append(1) elif 'neg' in filepath: train_y.append(0) else: continue train_list = [] with open(filepath, 'r') as f: for line in f: if six.PY2: tokenize_words = tokenizer.tokenize( line.decode('utf-8').strip()) else: tokenize_words = tokenizer.tokenize(line.strip()) lower_words = [word.lower() for word in tokenize_words] for word in lower_words: train_list.append(word_index.get(word, 0)) train_x.append(train_list) return train_x, train_y if __name__ == '__main__': parser = argparse.ArgumentParser() parser.add_argument('-trd', '--train_dir', type=str, nargs='?', default='./data/aclImdb/train/', help='train data directory') parser.add_argument('-ted', '--test_dir', type=str, nargs='?', default='./data/aclImdb/test/', help='test data directory') parser.add_argument('-wip', '--word_index_path', type=str, nargs='?', default='./data/aclimdb_word_index.json', help='aclimdb word index json') parser.add_argument('-onz', '--output_npz', type=str, nargs='?', default='./data/aclimdb.npz', help='output npz') args = parser.parse_args() train_dir = args.train_dir test_dir = args.test_dir word_index_path = args.word_index_path output_npz = args.output_npz word_index = get_word_index(word_index_path) train_x, train_y = build_data_index(train_dir, word_index) test_x, test_y = build_data_index(test_dir, word_index) np.savez(output_npz, x_train=np.asarray(train_x), y_train=np.asarray(train_y), x_test=np.asarray(test_x), y_test=np.asarray(test_y)) |
After running this python script, an aclimdb.npz file will be generated in the data directory, which file structure is consistent with the official imdb.npz.
Till now, we have almost prepared the data we need, but the official imdb.py seems not support the local file path, so here I write a simplified version aclimdb.py to support the above data:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 | #!/usr/bin/env python # -*- coding: utf-8 -*- # Author: TextMiner (textminer@foxmail.com) # Copyright 2018 @ AINLP from __future__ import absolute_import from __future__ import division from __future__ import print_function import json import numpy as np def get_word_index(path='./data/aclimdb_word_index.json'): with open(path) as f: return json.load(f) def load_data(path='./data/aclimdb.npz', num_words=None, skip_top=0, seed=113, start_char=1, oov_char=2, index_from=3): """A simplify version of the origin imdb.py load_data function https://github.com/keras-team/keras/blob/master/keras/datasets/imdb.py """ with np.load(path) as f: x_train, labels_train = f['x_train'], f['y_train'] x_test, labels_test = f['x_test'], f['y_test'] np.random.seed(seed) indices = np.arange(len(x_train)) np.random.shuffle(indices) x_train = x_train[indices] labels_train = labels_train[indices] indices = np.arange(len(x_test)) np.random.shuffle(indices) x_test = x_test[indices] labels_test = labels_test[indices] xs = np.concatenate([x_train, x_test]) labels = np.concatenate([labels_train, labels_test]) if start_char is not None: xs = [[start_char] + [w + index_from for w in x] for x in xs] elif index_from: xs = [[w + index_from for w in x] for x in xs] if not num_words: num_words = max([max(x) for x in xs]) # 0 (padding), 1 (start), 2(OOV) if oov_char is not None: xs = [[w if (skip_top <= w < num_words) else oov_char for w in x] for x in xs] else: xs = [[w for w in x if skip_top <= w < num_words] for x in xs] idx = len(x_train) x_train, y_train = np.array(xs[:idx]), np.array(labels[:idx]) x_test, y_test = np.array(xs[idx:]), np.array(labels[idx:]) return (x_train, y_train), (x_test, y_test) |
Now, you can quickly follow the chapter 3.4 of the Deep Learning with Python with our local data, the environment tested here is Mac OS, Python 2.7, Keras 2.14, Tensorflow 1.6.0, CPU environment, the model training here without GUP is also very fast, please note the file structure where the code is executed here, if yours are different, please specify the data path:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 | In [1]: import aclimdb # Note the default npz path is "data/aclimdb.npz", if you are not, specify it: In [2]: (train_data, train_labels), (test_data, test_labels) = aclimdb.load_data(num_words=10000) In [3]: train_data[0] Out[3]: [1, 7799, 1459, ... 11, 13, 3320, 2] In [4]: train_labels[0] Out[4]: 0 In [5]: max([max(sequence) for sequence in train_data]) Out[5]: 9999 # Note the default index path is: "data/aclimdb_word_index.json" In [6]: word_index = aclimdb.get_word_index() In [8]: reverse_word_index = dict([(value, key) for (key, value) in word_index.items()]) In [9]: decoded_review = ' '.join([reverse_word_index.get(i - 3, '?') for i in train_data[0]]) In [10]: decoded_review Out[10]: u'? hi folks < br / > < br / > forget about that movie . john c. should be ashamed that he appears as executive producer in the ? bon ? has never been and will never be an actor and the fx are a joke . < br / > < br / > the first vampires was good ... and it was the only vampires . this thing here just wears the same name . < br / > < br / > just a waste of time thinks ... < br / > < br / > jake ?' In [11]: import numpy as np In [13]: def vectorize_sequences(sequences, dimension=10000): ...: results = np.zeros((len(sequences), dimension)) ...: for i, sequence in enumerate(sequences): ...: results[i, sequence] = 1 ...: return results ...: In [14]: x_train = vectorize_sequences(train_data) In [15]: x_test = vectorize_sequences(test_data) In [16]: x_train[0] Out[16]: array([0., 1., 1., ..., 0., 0., 0.]) In [17]: y_train = np.asarray(train_labels).astype('float32') In [18]: y_test = np.asarray(test_labels).astype('float32') In [19]: from keras import models Using TensorFlow backend. In [20]: from keras import layers In [21]: model = models.Sequential() In [22]: model.add(layers.Dense(16, activation='relu', input_shape=(10000,))) In [23]: model.add(layers.Dense(16, activation='relu')) In [24]: model.add(layers.Dense(1, activation='sigmoid')) In [25]: model.compile(optimizer='rmsprop', ...: loss='binary_crossentropy', ...: metrics=['accuracy']) In [26]: model.fit(x_train, y_train, epochs=4, batch_size=512) Epoch 1/4 25000/25000 [==============================] - 3s 140us/step - loss: 0.4544 - acc: 0.8192 Epoch 2/4 25000/25000 [==============================] - 2s 93us/step - loss: 0.2632 - acc: 0.9077 Epoch 3/4 25000/25000 [==============================] - 2s 92us/step - loss: 0.2053 - acc: 0.9244 Epoch 4/4 25000/25000 [==============================] - 2s 92us/step - loss: 0.1708 - acc: 0.9388 Out[26]: <keras.callbacks.History at 0x206cfdc10> In [27]: resuls = model.evaluate(x_test, y_test) 25000/25000 [==============================] - 4s 145us/step In [28]: resuls Out[28]: [0.2953770682477951, 0.88304] In [29]: model.predict(x_test) Out[29]: array([[9.9612302e-01], [9.5416462e-01], [1.5807265e-05], ..., [9.9868757e-01], [8.4713501e-01], [5.7828808e-01]], dtype=float32) |
More you can read and learn the book Deep Learning with Python, very good to recommended again.
Posted by TextMiner
Comments
Deep Learning Practice for NLP: Large Movie Review Data Sentiment Analysis from Scratch — No Comments