Dive Into NLTK, Part IV: Stemming and Lemmatization
This is the fourth article in the series “Dive Into NLTK“, here is an index of all the articles in the series that have been published to date:
Part I: Getting Started with NLTK
Part II: Sentence Tokenize and Word Tokenize
Part III: Part-Of-Speech Tagging and POS Tagger
Part IV: Stemming and Lemmatization
Part V: Using Stanford Text Analysis Tools in Python
Part VI: Add Stanford Word Segmenter Interface for Python NLTK
Part VII: A Preliminary Study on Text Classification
Part VIII: Using External Maximum Entropy Modeling Libraries for Text Classification
Part IX: From Text Classification to Sentiment Analysis
Part X: Play With Word2Vec Models based on NLTK Corpus
Stemming and Lemmatization are the basic text processing methods for English text. The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form. Here is the definition from wikipedia for stemming and lemmatization:
In linguistic morphology and information retrieval, stemming is the process for reducing inflected (or sometimes derived) words to their stem, base or root form—generally a written word form. The stem need not be identical to the morphological root of the word; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root. Algorithms for stemming have been studied in computer science since the 1960s. Many search engines treat words with the same stem as synonyms as a kind of query expansion, a process called conflation.
Stemming programs are commonly referred to as stemming algorithms or stemmers.
Lemmatisation (or lemmatization) in linguistics, is the process of grouping together the different inflected forms of a word so they can be analysed as a single item.
In computational linguistics, lemmatisation is the algorithmic process of determining the lemma for a given word. Since the process may involve complex tasks such as understanding context and determining the part of speech of a word in a sentence (requiring, for example, knowledge of the grammar of a language) it can be a hard task to implement a lemmatiser for a new language.
In many languages, words appear in several inflected forms. For example, in English, the verb ‘to walk’ may appear as ‘walk’, ‘walked’, ‘walks’, ‘walking’. The base form, ‘walk’, that one might look up in a dictionary, is called the lemma for the word. The combination of the base form with the part of speech is often called the lexeme of the word.
Lemmatisation is closely related to stemming. The difference is that a stemmer operates on a single word without knowledge of the context, and therefore cannot discriminate between words which have different meanings depending on part of speech. However, stemmers are typically easier to implement and run faster, and the reduced accuracy may not matter for some applications.
How to use Stemmer in NLTK
NLTK provides several famous stemmers interfaces, such as Porter stemmer, Lancaster Stemmer, Snowball Stemmer and etc. In NLTK, using those stemmers is very simple.
For Porter Stemmer, which is based on The Porter Stemming Algorithm, can be used like this:
>>> from nltk.stem.porter import PorterStemmer
>>> porter_stemmer = PorterStemmer()
>>> porter_stemmer.stem(‘maximum’)
u’maximum’
>>> porter_stemmer.stem(‘presumably’)
u’presum’
>>> porter_stemmer.stem(‘multiply’)
u’multipli’
>>> porter_stemmer.stem(‘provision’)
u’provis’
>>> porter_stemmer.stem(‘owed’)
u’owe’
>>> porter_stemmer.stem(‘ear’)
u’ear’
>>> porter_stemmer.stem(‘saying’)
u’say’
>>> porter_stemmer.stem(‘crying’)
u’cri’
>>> porter_stemmer.stem(‘string’)
u’string’
>>> porter_stemmer.stem(‘meant’)
u’meant’
>>> porter_stemmer.stem(‘cement’)
u’cement’
>>>
For Lancaster Stemmer, which is based on The Lancaster Stemming Algorithm, can be used in NLTK like this:
>>> from nltk.stem.lancaster import LancasterStemmer
>>> lancaster_stemmer = LancasterStemmer()
>>> lancaster_stemmer.stem(‘maximum’)
‘maxim’
>>> lancaster_stemmer.stem(‘presumably’)
‘presum’
>>> lancaster_stemmer.stem(‘presumably’)
‘presum’
>>> lancaster_stemmer.stem(‘multiply’)
‘multiply’
>>> lancaster_stemmer.stem(‘provision’)
u’provid’
>>> lancaster_stemmer.stem(‘owed’)
‘ow’
>>> lancaster_stemmer.stem(‘ear’)
‘ear’
>>> lancaster_stemmer.stem(‘saying’)
‘say’
>>> lancaster_stemmer.stem(‘crying’)
‘cry’
>>> lancaster_stemmer.stem(‘string’)
‘string’
>>> lancaster_stemmer.stem(‘meant’)
‘meant’
>>> lancaster_stemmer.stem(‘cement’)
‘cem’
>>>
For Snowball Stemmer, which is based on Snowball Stemming Algorithm, can be used in NLTK like this:
>>> from nltk.stem import SnowballStemmer
>>> snowball_stemmer = SnowballStemmer(“english”)
>>> snowball_stemmer.stem(‘maximum’)
u’maximum’
>>> snowball_stemmer.stem(‘presumably’)
u’presum’
>>> snowball_stemmer.stem(‘multiply’)
u’multipli’
>>> snowball_stemmer.stem(‘provision’)
u’provis’
>>> snowball_stemmer.stem(‘owed’)
u’owe’
>>> snowball_stemmer.stem(‘ear’)
u’ear’
>>> snowball_stemmer.stem(‘saying’)
u’say’
>>> snowball_stemmer.stem(‘crying’)
u’cri’
>>> snowball_stemmer.stem(‘string’)
u’string’
>>> snowball_stemmer.stem(‘meant’)
u’meant’
>>> snowball_stemmer.stem(‘cement’)
u’cement’
>>>
How to use Lemmatizer in NLTK
The NLTK Lemmatization method is based on WordNet’s built-in morphy function. Here is the introduction from WordNet official website:
WordNet® is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations. The resulting network of meaningfully related words and concepts can be navigated with the browser. WordNet is also freely and publicly available for download. WordNet’s structure makes it a useful tool for computational linguistics and natural language processing.
WordNet superficially resembles a thesaurus, in that it groups words together based on their meanings. However, there are some important distinctions. First, WordNet interlinks not just word forms—strings of letters—but specific senses of words. As a result, words that are found in close proximity to one another in the network are semantically disambiguated. Second, WordNet labels the semantic relations among words, whereas the groupings of words in a thesaurus does not follow any explicit pattern other than meaning similarity
In NLTK, you can use it as the following:
>>> from nltk.stem import WordNetLemmatizer
>>> wordnet_lemmatizer = WordNetLemmatizer()
>>> wordnet_lemmatizer.lemmatize(‘dogs’)
u’dog’
>>> wordnet_lemmatizer.lemmatize(‘churches’)
u’church’
>>> wordnet_lemmatizer.lemmatize(‘aardwolves’)
u’aardwolf’
>>> wordnet_lemmatizer.lemmatize(‘abaci’)
u’abacus’
>>> wordnet_lemmatizer.lemmatize(‘hardrock’)
‘hardrock’
>>> wordnet_lemmatizer.lemmatize(‘are’)
‘are’
>>> wordnet_lemmatizer.lemmatize(‘is’)
‘is’
You would note that the “are” and “is” lemmatize results are not “be”, that’s because the lemmatize method default pos argument is “n”:
lemmatize(word, pos=’n’)
So you need specified the pos for the word like these:
>>> wordnet_lemmatizer.lemmatize(‘is’, pos=’v’)
u’be’
>>> wordnet_lemmatizer.lemmatize(‘are’, pos=’v’)
u’be’
>>>
We have use POS Tagging before word lemmatization, and implemented it in our Text Analysis API, you can test and use it without specified pos tagger by our Text Analysis API.
The Stem Module in NLTK
You can find the stem module in nltk-master/nltk/stem
YangtekiMacBook-Pro:stem textminer$ ls
total 456
-rwxr-xr-x@ 1 textminer staff 1270 7 22 2013 __init__.py
-rwxr-xr-x@ 1 textminer staff 798 7 22 2013 api.py
-rwxr-xr-x@ 1 textminer staff 17068 7 22 2013 isri.py
-rwxr-xr-x@ 1 textminer staff 11337 7 22 2013 lancaster.py
-rwxr-xr-x@ 1 textminer staff 24735 7 22 2013 porter.py
-rwxr-xr-x@ 1 textminer staff 1701 7 22 2013 regexp.py
-rwxr-xr-x@ 1 textminer staff 5563 7 22 2013 rslp.py
-rwxr-xr-x@ 1 textminer staff 146857 7 22 2013 snowball.py
-rwxr-xr-x@ 1 textminer staff 1513 7 22 2013 wordnet.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 | # Natural Language Toolkit: Stemmers # # Copyright (C) 2001-2013 NLTK Project # Author: Trevor Cohn <tacohn@cs.mu.oz.au> # Edward Loper <edloper@gradient.cis.upenn.edu> # Steven Bird <stevenbird1@gmail.com> # URL: <http://www.nltk.org/> # For license information, see LICENSE.TXT """ NLTK Stemmers Interfaces used to remove morphological affixes from words, leaving only the word stem. Stemming algorithms aim to remove those affixes required for eg. grammatical role, tense, derivational morphology leaving only the stem of the word. This is a difficult problem due to irregular words (eg. common verbs in English), complicated morphological rules, and part-of-speech and sense ambiguities (eg. ``ceil-`` is not the stem of ``ceiling``). StemmerI defines a standard interface for stemmers. """ from nltk.stem.api import StemmerI from nltk.stem.regexp import RegexpStemmer from nltk.stem.lancaster import LancasterStemmer from nltk.stem.isri import ISRIStemmer from nltk.stem.porter import PorterStemmer from nltk.stem.snowball import SnowballStemmer from nltk.stem.wordnet import WordNetLemmatizer from nltk.stem.rslp import RSLPStemmer if __name__ == "__main__": import doctest doctest.testmod(optionflags=doctest.NORMALIZE_WHITESPACE) |
Read the code, and change the World! Now it’s your time!
Posted by TextMiner
I have gone through the text. I have been working in text summerization for ‘Gujarati Language. I want to develop lemmatization for the gujarati language. May I know if any provision for lemmatizing the Gujarati language words?
sorry, i don’t know
there is no package to use for
you have to build your own package using dictionary which is a huge task
probably you can translate each word of Gujarati text into English and then do lemmatization and translate it back into the Gujarati language it may not be accurate but serves the purpose
Sir,
I think translation of each word from native language to English and then doing Lemmatization is not much good method. I think we have to work using unicode for our language.
Have you got any idea in training own language?
How can I build own package? which language can I use for?Is python itself enough for building own package?I am trying for Tamil Language.
Sorry, can’t help you more. It’s not depend on the programming languages, it’s depend on your target language like Tamil Language.