HomeNLP ToolsGetting Started with Pattern

We have talked about NLTK and TextBlob, now it’s time to “Getting Started with Pattern”.

About Pattern

According Pattern Official Website:

Pattern is a web mining module for the Python programming language.

It has tools for data mining (Google, Twitter and Wikipedia API, a web crawler, a HTML DOM parser), natural language processing (part-of-speech taggers, n-gram search, sentiment analysis, WordNet), machine learning (vector space model, clustering, SVM), network analysis and canvas visualization.

Pattern is not only one of Text Processing Python NLP Tools, but a text data mining tool which including crawler, parser and etc. But in this article we will focus on the Text Analysis Modules from Pattern.

Installing Pattern

The simplest way to install Pattern is using pip:

$ pip install pattern

This will automatically download and install Pattern from the PyPi repository.

Another way to install Pattern from the source code: Pattern 2.6. Just download it and unzip it, then:

$ cd pattern-2.6
$ python setup.py install

Note that Pattern requires Python 2.5+, not support for Python 3 yet. The module has no external dependencies, except LSA in the pattern.vector module, which requires NumPy.

Pattern Modules
Pattern includes a lot of modules:

Here we will just focus on the English Text Analysis Module pattern.en.

Text Analysis Module pattern.en

The pattern.en module can be used for English text processing, like word tokenize, sentence tokenize(sentence segmentation), pos tagging(part-of-speech tagger), english grammar(Indefinite article, Pluralization + singularization, Comparative + superlative), sentiment analysis, Parser(identifies nouns, adjectives..), WordNet interface and etc. We will tell you how to use them in Python one by one.

1) Word Tokenization and Sentence Segmentation

The tokenize methods in pattern.en supports word tokenization and sentence tokenization for english text:

>>> from pattern.en import tokenize
>>> word_tokenize_test = “this’s pattern word tokenize”
>>> tokenize(word_tokenize_test)
[“this ‘s pattern word tokenize”]

>>> sent_tokenize_test = “Tokenization is the process of breaking a stream of text up into words, phrases, symbols, or other meaningful elements called tokens. The list of tokens becomes input for further processing such as parsing or text mining. Tokenization is useful both in linguistics (where it is a form of text segmentation), and in computer science, where it forms part of lexical analysis.”
>>> result = tokenize(sent_tokenize_test)
>>> result
[‘Tokenization is the process of breaking a stream of text up into words , phrases , symbols , or other meaningful elements called tokens .’, ‘The list of tokens becomes input for further processing such as parsing or text mining .’, ‘Tokenization is useful both in linguistics ( where it is a form of text segmentation ) , and in computer science , where it forms part of lexical analysis .’]
>>> len(result)

2) POS Tagging

The tag method from pattern.en can be used POS Tagging for english text, the part-of-speech tags is followed Penn Treebank II tag set, can be found the details here: Penn Treebank II tag set

>>> from pattern.en import tag
>>> pos_tagging_test = “In corpus linguistics, part-of-speech tagging (POS tagging or POST), also called grammatical tagging or word-category disambiguation, is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definition, as well as its context—i.e. relationship with adjacent and related words in a phrase, sentence, or paragraph. A simplified form of this is commonly taught to school-age children, in the identification of words as nouns, verbs, adjectives, adverbs, etc.”
>>> tagged_result = tag(pos_tagging_test)
>>> tagged_result
[(u’In’, u’IN’), (u’corpus’, u’NN’), (u’linguistics’, u’NNS’), (u’,’, u’,’), (u’part-of-speech’, u’NN’), (u’tagging’, u’VBG’), (u'(‘, u'(‘), (u’POS’, u’JJ’), (u’tagging’, u’VBG’), (u’or’, u’CC’), (u’POST’, u’NNP’), (u’)’, u’)’), (u’,’, u’,’), (u’also’, u’RB’), (u’called’, u’VBD’), (u’grammatical’, u’JJ’), (u’tagging’, u’VBG’), (u’or’, u’CC’), (u’word-category’, u’VBD’), (u’disambiguation’, u’NN’), (u’,’, u’,’), (u’is’, u’VBZ’), (u’the’, u’DT’), (u’process’, u’NN’), (u’of’, u’IN’), (u’marking’, u’VBG’), (u’up’, u’RP’), (u’a’, u’DT’), (u’word’, u’NN’), (u’in’, u’IN’), (u’a’, u’DT’), (u’text’, u’NN’), (u'(‘, u'(‘), (u’corpus’, u’NN’), (u’)’, u’)’), (u’as’, u’IN’), (u’corresponding’, u’JJ’), (u’to’, u’TO’), (u’a’, u’DT’), (u’particular’, u’JJ’), (u’part’, u’NN’), (u’of’, u’IN’), (u’speech’, u’NN’), (u’,’, u’,’), (u’based’, u’VBN’), (u’on’, u’IN’), (u’both’, u’DT’), (u’its’, u’PRP$’), (u’definition’, u’NN’), (u’,’, u’,’), (u’as’, u’RB’), (u’well’, u’RB’), (u’as’, u’IN’), (u’its’, u’PRP$’), (u’context\u2014i.e’, u’NN’), (u’.’, u’.’), (u’relationship’, u’NN’), (u’with’, u’IN’), (u’adjacent’, u’JJ’), (u’and’, u’CC’), (u’related’, u’JJ’), (u’words’, u’NNS’), (u’in’, u’IN’), (u’a’, u’DT’), (u’phrase’, u’NN’), (u’,’, u’,’), (u’sentence’, u’NN’), (u’,’, u’,’), (u’or’, u’CC’), (u’paragraph’, u’NN’), (u’.’, u’.’), (u’A’, u’DT’), (u’simplified’, u’JJ’), (u’form’, u’NN’), (u’of’, u’IN’), (u’this’, u’DT’), (u’is’, u’VBZ’), (u’commonly’, u’RB’), (u’taught’, u’NN’), (u’to’, u’TO’), (u’school-age’, u’JJ’), (u’children’, u’NNS’), (u’,’, u’,’), (u’in’, u’IN’), (u’the’, u’DT’), (u’identification’, u’NN’), (u’of’, u’IN’), (u’words’, u’NNS’), (u’as’, u’IN’), (u’nouns’, u’NNS’), (u’,’, u’,’), (u’verbs’, u’NNS’), (u’,’, u’,’), (u’adjectives’, u’NNS’), (u’,’, u’,’), (u’adverbs’, u’NNS’), (u’,’, u’,’), (u’etc.’, u’NN’)]

3)Indefinite Article

According wikipedia:

An indefinite article indicates that its noun is not a particular one (or ones) identifiable to the listener. It may be something that the speaker is mentioning for the first time, or its precise identity may be irrelevant or hypothetical, or the speaker may be making a general statement about any such thing. English uses a/an, from the Old English forms of the number ‘one’, as its primary indefinite article. The form an is used before words that begin with a vowel sound (even if spelled with an initial consonant, as in an hour), and a before words that begin with a consonant sound (even if spelled with a vowel, as in a European).

In pattern.en module, you can use the referenced method to get the indefinite article:

>>> from pattern.en import referenced
>>> referenced(“book”)
‘a book’
>>> referenced(“hour”)
‘an hour’

4) Word Singularize

>>> from pattern.en import singularize
>>> singularize(“books”)
>>> singularize(“wolves”)

5) Word Pluralize

>>> from pattern.en import pluralize
>>> pluralize(“wolf”)
>>> pluralize(“book”)

6)Word Comparative

>>> from pattern.en import comparative
>>> comparative(‘bad’)
>>> comparative(‘good’)

7) Word Superlative

>>> from pattern.en import superlative
>>> superlative(‘bad’)
>>> superlative(‘good’)

8) Sentiment Analysis

Sentiment analysis (also known as opinion mining) refers to the use of natural language processing, text analysis and computational linguistics to identify and extract subjective information in source materials. Generally speaking, sentiment analysis aims to determine the attitude of a speaker or a writer with respect to some topic or the overall contextual polarity of a document. The attitude may be his or her judgment or evaluation (see appraisal theory), affective state (that is to say, the emotional state of the author when writing), or the intended emotional communication (that is to say, the emotional effect the author wishes to have on the reader).

The pattern.en sentiment() function returns a (polarity, subjectivity)-tuple for the given sentence, based on the adjectives it contains, where polarity is a value between -1.0 and +1.0 and subjectivity between 0.0 and 1.0.

>>> from pattern.en import sentiment
>>> sentiment(“iPhone 5 is best smartphone in the world”)
(1.0, 0.3)
>>> sentiment(“the car is bad”)
(-0.6999999999999998, 0.6666666666666666)

9) Modality

>>> from pattern.en import parse, Sentence
>>> from pattern.en import modality
>>> s = “Some amino acids tend to be acidic while others may be basic.” # weaseling
>>> s = parse(s, lemmata=True)
>>> s = Sentence(s)
>>> modality(s)

10) Parse

>>> from pattern.en import parse
>>> parse(‘Parsing or syntactic analysis is the process of analysing a string of symbols, either in natural language or in computer languages, according to the rules of a formal grammar. The term parsing comes from Latin pars (orationis), meaning part (of speech).’)
u’Parsing/VBG/B-VP/O or/CC/O/O syntactic/JJ/B-NP/O analysis/NN/I-NP/O is/VBZ/B-VP/O the/DT/B-NP/O process/NN/I-NP/O of/IN/B-PP/B-PNP analysing/VBG/B-VP/I-PNP a/DT/B-NP/I-PNP string/NN/I-NP/I-PNP of/IN/B-PP/B-PNP symbols/NNS/B-NP/I-PNP ,/,/O/O either/DT/O/O in/IN/B-PP/B-PNP natural/JJ/B-NP/I-PNP language/NN/I-NP/I-PNP or/CC/O/O in/IN/B-PP/B-PNP computer/NN/B-NP/I-PNP languages/NNS/I-NP/I-PNP ,/,/O/O according/VBG/B-VP/O to/TO/O/O the/DT/B-NP/O rules/NNS/I-NP/O of/IN/B-PP/B-PNP a/DT/B-NP/I-PNP formal/JJ/I-NP/I-PNP grammar/NN/I-NP/I-PNP ././O/O\nThe/DT/B-NP/O term/NN/I-NP/O parsing/VBG/B-VP/O comes/VBZ/I-VP/O from/IN/B-PP/B-PNP Latin/JJ/B-NP/I-PNP pars/NNS/I-NP/I-PNP (/(/O/O orationis/JJ/B-ADJP/O )/)/O/O ,/,/O/O meaning/VBG/B-VP/O part/NN/B-NP/O (/(/O/O of/IN/B-PP/B-PNP speech/NN/B-NP/I-PNP )/)/O/O ././O/O’

We just told 10 text process methods here, you can check other methods by the official pattern guide. If you want use pattern in other programming languages like java, php, ruby, we strongly recommend you try our Text Analysis API on Mashape.

Posted by TextMiner


Getting Started with Pattern — 1 Comment

  1. I tried the Sentiment, just wanted to note to those who (like me) run into errors: the quotes around the sentence should be ‘…’ and not “…”. This took me the better part of an hour to find out, hope it helps someone!

Leave a Reply

Your email address will not be published. Required fields are marked *