Dive Into NLTK, Part VI: Add Stanford Word Segmenter Interface for Python NLTK
This is the sixth article in the series “Dive Into NLTK“, here is an index of all the articles in the series that have been published to date:
Part I: Getting Started with NLTK
Part II: Sentence Tokenize and Word Tokenize
Part III: Part-Of-Speech Tagging and POS Tagger
Part IV: Stemming and Lemmatization
Part V: Using Stanford Text Analysis Tools in Python
Part VI: Add Stanford Word Segmenter Interface for Python NLTK
Part VII: A Preliminary Study on Text Classification
Part VIII: Using External Maximum Entropy Modeling Libraries for Text Classification
Part IX: From Text Classification to Sentiment Analysis
Part X: Play With Word2Vec Models based on NLTK Corpus
Stanford Word Segmenter is one of the open source Java text analysis tools provided by Stanford NLP Group. We have said how to using Stanford text analysis tools in NLTK, cause NLTK provide the interfaces for those Stanford NLP Tools like POS Tagger, Named Entity Recognizer and Parser. But for Stanford Word Segmenter, there is no interface in NLTK, no interface in Python, by google. So I decided write the Stanford Segmenter interface in NLTK, like the tagger and parser.
But before you can use it in Python NLTK, you should first install the latest version of NLTK by the source code, here we recommended the develop version of NLTK in github: https://github.com/nltk/nltk. Second you need install the Java environment, following is the steps in Ubuntu 12.04 vps:
sudo apt-get update
Then, check if Java is not already installed:java -version
If it returns “The program java can be found in the following packages”, Java hasn’t been installed yet, so execute the following command:sudo apt-get install default-jre
This will install the Java Runtime Environment (JRE). If you instead need the Java Development Kit (JDK), which is usually needed to compile Java applications (for example Apache Ant, Apache Maven, Eclipse and IntelliJ IDEA execute the following command:sudo apt-get install default-jdk
That is everything that is needed to install Java.
The last thing is download and unzip the latest Stanford Word Segmenter Package: Download Stanford Word Segmenter version 2014-08-27.
In NLTK code, the Stanford Tagger interface is here: nltk/tag/stanford.py, the Stanford Parser interface is here: nltk/parse/stanford.py, we want add the Stanford Segmenter in the nltk/tokenize director, but found an stanford.py which support Stanford PTBTokenizer. So we add a stanford_segmenter.py in the nltk/tokenize director, which used as the Stanford Word Segmenter interface and based on Linux PIPE and Python subprocess module:
#!/usr/bin/env python # -*- coding: utf-8 -*- # Natural Language Toolkit: Interface to the Stanford Chinese Segmenter # # Copyright (C) 2001-2014 NLTK Project # Author: 52nlp <52nlpcn@gmail.com> # # URL: <http://nltk.org/> # For license information, see LICENSE.TXT from __future__ import unicode_literals, print_function import tempfile import os import json from subprocess import PIPE from nltk import compat from nltk.internals import find_jar, config_java, java, _java_options from nltk.tokenize.api import TokenizerI class StanfordSegmenter(TokenizerI): r""" Interface to the Stanford Segmenter >>> from nltk.tokenize.stanford_segmenter import StanfordSegmenter >>> segmenter = StanfordSegmenter(path_to_jar="stanford-segmenter-3.4.1.jar", path_to_sihan_corpora_dict="./data", path_to_model="./data/pku.gz", path_to_dict="./data/dict-chris6.ser.gz") >>> sentence = u"这是斯坦福中文分词器测试" >>> segmenter.segment(sentence) >>> u'\u8fd9 \u662f \u65af\u5766\u798f \u4e2d\u6587 \u5206\u8bcd\u5668 \u6d4b\u8bd5\n' >>> segmenter.segment_file("test.simp.utf8") >>> u'\u9762\u5bf9 \u65b0 \u4e16\u7eaa \uff0c \u4e16\u754c \u5404\u56fd ... """ _JAR = 'stanford-segmenter.jar' def __init__(self, path_to_jar=None, path_to_sihan_corpora_dict=None, path_to_model=None, path_to_dict=None, encoding='UTF-8', options=None, verbose=False, java_options='-mx2g'): self._stanford_jar = find_jar( self._JAR, path_to_jar, env_vars=('STANFORD_SEGMENTER',), searchpath=(), verbose=verbose ) self._sihan_corpora_dict = path_to_sihan_corpora_dict self._model = path_to_model self._dict = path_to_dict self._encoding = encoding self.java_options = java_options options = {} if options is None else options self._options_cmd = ','.join('{0}={1}'.format(key, json.dumps(val)) for key, val in options.items()) def segment_file(self, input_file_path): """ """ cmd = [ 'edu.stanford.nlp.ie.crf.CRFClassifier', '-sighanCorporaDict', self._sihan_corpora_dict, '-textFile', input_file_path, '-sighanPostProcessing', 'true', '-keepAllWhitespaces', 'false', '-loadClassifier', self._model, '-serDictionary', self._dict ] stdout = self._execute(cmd) return stdout def segment(self, tokens): return self.segment_sents([tokens]) def segment_sents(self, sentences): """ """ encoding = self._encoding # Create a temporary input file _input_fh, self._input_file_path = tempfile.mkstemp(text=True) # Write the actural sentences to the temporary input file _input_fh = os.fdopen(_input_fh, 'wb') _input = '\n'.join((' '.join(x) for x in sentences)) if isinstance(_input, compat.text_type) and encoding: _input = _input.encode(encoding) _input_fh.write(_input) _input_fh.close() cmd = [ 'edu.stanford.nlp.ie.crf.CRFClassifier', '-sighanCorporaDict', self._sihan_corpora_dict, '-textFile', self._input_file_path, '-sighanPostProcessing', 'true', '-keepAllWhitespaces', 'false', '-loadClassifier', self._model, '-serDictionary', self._dict ] stdout = self._execute(cmd) # Delete the temporary file os.unlink(self._input_file_path) return stdout def _execute(self, cmd, verbose=False): encoding = self._encoding cmd.extend(['-inputEncoding', encoding]) _options_cmd = self._options_cmd if _options_cmd: cmd.extend(['-options', self._options_cmd]) default_options = ' '.join(_java_options) # Configure java. config_java(options=self.java_options, verbose=verbose) stdout, _stderr = java(cmd,classpath=self._stanford_jar, stdout=PIPE, stderr=PIPE) stdout = stdout.decode(encoding) # Return java configurations to their default values. config_java(options=default_options, verbose=False) return stdout def setup_module(module): from nose import SkipTest try: StanfordSegmenter() except LookupError: raise SkipTest('doctests from nltk.tokenize.stanford_segmenter are skipped because the stanford segmenter jar doesn\'t exist') |
We have forked the latest NLTK project and add the stanford_segmenter.py in it. You can get this version or just add the stanford_segmenter.py in your latest NLTK package in the nltk/tokenize/ directory, and reinstall it. The using example we have showed in the code, for test, you need “cd stanford-segmenter-2014-08-27″ first, than test it in the python interpreter:
>>> from nltk.tokenize.stanford_segmenter import StanfordSegmenter
>>> segmenter = StanfordSegmenter(path_to_jar=”stanford-segmenter-3.4.1.jar”, path_to_sihan_corpora_dict=”./data”, path_to_model=”./data/pku.gz”, path_to_dict=”./data/dict-chris6.ser.gz”)
>>> sentence = u”这是斯坦福中文分词器测试”
>>> segmenter.segment(sentence)
>>> u’\u8fd9 \u662f \u65af\u5766\u798f \u4e2d\u6587 \u5206\u8bcd\u5668 \u6d4b\u8bd5\n’
>>> segmenter.segment_file(“test.simp.utf8”)
>>> u’\u9762\u5bf9 \u65b0 \u4e16\u7eaa \uff0c \u4e16\u754c \u5404\u56fd .
>>> outfile = open(‘outfile’, ‘w’)
>>> result = segmenter.segment(sentence)
>>> outfile.write(result.encode(‘UTF-8′))
>>> outfile.close()
Open the outfile, we get: 这 是 斯坦福 中文 分词器 测试.
The problem we met here is when execute the “segment” or “segment_file” method once, the interface could load the model and dict once. I try use the “readStdin” and communicate method in the subprocess module, but cannot resolve this problem. After google the PIPE and subprocess document long time, I can’t find a proper method to load the model and dict first, then execute the segment one by one no need loading the data. Can you give me a method to resolve this problem?
Wonderful article !
To your problem:
I have met the same problem and resolved by Queue technology. Basic idea is, to create multiple Docker containers and give them the different role. A container used as Input, this container just get the input strings and put them into RabbitMQ queue. On the other side of the queue, there are multiple containers used as workers, each of them doing the job that loading analysis process program (which is the code in showing console of this article). And then, these workers put the results to another queue in RabbitMQ, and on the other side of this queue, there is a container used as Output. This system architecture used to process multiple messages in parallel.
If you need more help eg. the diagram of the architecture, feel free to send me email please.
Adrian
thanks a lot
Hi textminingonline, do you mind if we add this code to the official NLTK codebase?
No problem, thanks
>>> segmenter = StanfordSegmenter(path_to_jar=”stanford-segmenter-3.6.0.jar”, path_to_sihan_corpora_dict=”./data”, path_to_model=”./data/pku.gz”, path_to_dict=”./data/dict-chris6.ser.gz”)
Traceback (most recent call last):
File “”, line 1, in
TypeError: Can’t instantiate abstract class StanfordSegmenter with abstract methods tokenize
—
Hi, I’m new to Python and have some Chinese data to process, accidentally found this and your earlier post on . I downloaded the latest segmenter(3.6.0) and follow (most of) your instructions but ended up with an error above. Do you know what could have caused the Error? Thanks!