Recently, I have surveyed some keyword extraction tools, papers and documents, and record them here for getting started with keyword extraction. According wikipedia, Keyword Extraction is defined like this:
Keyword extraction is tasked with the automatic identification of terms that best describe the subject of a document.
Key phrases, key terms, key segments or just keywords are the terminology which is used for defining the terms that represent the most relevant information contained in the document. Although the terminology is different, function is the same: characterization of the topic discused in a document. Keyword extraction task is important problem in Text Mining, Information Retrieval and Natural Language Processing.
1. RAKE (A python implementation of the Rapid Automatic Keyword Extraction)
Started with RAKE, a python implementation of the Rapid Automatic Keyword Extraction, I follow the document “NLP keyword extraction tutorial with RAKE and Maui“. As the document said:
A typical keyword extraction algorithm has three main components:
- Candidate selection: Here, we extract all possible words, phrases, terms or concepts (depending on the task) that can potentially be keywords.
- Properties calculation: For each candidate, we need to calculate properties that indicate that it may be a keyword. For example, a candidate appearing in the title of a book is a likely keyword.
- Scoring and selecting keywords: All candidates can be scored by either combining the properties into a formula, or using a machine learning technique to determine probability of a candidate being a keyword. A score or probability threshold, or a limit on the number of keywords is then used to select the final set of keywords..
RAKE follow the three steps strictly, and have a good design structure for keyword extraction. Follow the document example Rake tutorial, I tested RAKE on my mac os environment step-by-step:
git clone https://github.com/zelandiya/RAKE-tutorial
then launch ipython environment in the RAKE-tutorial directory, and test it:
Python 2.7.6 (default, Jun 3 2014, 07:43:23) Type "copyright", "credits" or "license" for more information. IPython 3.1.0 -- An enhanced Interactive Python. ? -> Introduction and overview of IPython's features. %quickref -> Quick reference. help -> Python's own help system. object? -> Details about 'object', use 'object??' for extra details. In : import rake In : import operator In : rake_object = rake.Rake("SmartStoplist.txt", 3, 3, 1) In : text = "Natural language processing (NLP) deals with the application of computational models to text or speech data. Application areas within NLP include automatic (machine) translation between languages; dialogue systems, which allow a human to interact with a machine using natural language; and information extraction, where the goal is to transform unstructured text into structured (database) representations that can be searched and browsed in flexible ways. NLP technologies are having a dramatic impact on the way people interact with computers, on the way people interact with each other through the use of language, and on the way people access the vast amount of linguistic data now in electronic form. From a scientific viewpoint, NLP involves fundamental questions of how to structure formal models (for example statistical models) of natural language phenomena, and of how to design algorithms that implement these models. In this course you will study mathematical and computational models of language, and the application of these models to key problems in natural language processing. The course has a focus on machine learning methods, which are widely used in modern NLP systems: we will cover formalisms such as hidden Markov models, probabilistic context-free grammars, log-linear models, and statistical models for machine translation. The curriculum closely follows a course currently taught by Professor Collins at Columbia University, and previously taught at MIT." In : keywords = rake_object.run(text) In : print "keywords: ", keywords keywords: [('nlp include automatic', 8.25), ('transform unstructured text', 8.0), ('hidden markov models', 8.0), ('structure formal models', 8.0), ('natural language phenomena', 7.916666666666666), ('natural language processing', 7.916666666666666), ('modern nlp systems', 7.75), ('machine learning methods', 7.75), ('natural language', 4.916666666666666), ('dialogue systems', 4.5), ('nlp technologies', 4.25), ('study mathematical', 4.0), ('information extraction', 4.0), ('electronic form', 4.0), ('vast amount', 4.0), ('speech data', 4.0), ('scientific viewpoint', 4.0), ('columbia university', 4.0), ('free grammars', 4.0), ('cover formalisms', 4.0), ('dramatic impact', 4.0), ('design algorithms', 4.0), ('flexible ways', 4.0), ('key problems', 4.0), ('linguistic data', 4.0), ('probabilistic context', 4.0), ('people access', 4.0), ('linear models', 4.0), ('curriculum closely', 4.0), ('professor collins', 4.0), ('statistical models', 4.0), ('computational models', 4.0), ('people interact', 3.666666666666667), ('previously taught', 3.5), ('application areas', 3.333333333333333), ('machine translation', 3.25), ('nlp', 2.25), ('language', 2.1666666666666665), ('text', 2.0), ('models', 2.0), ('machine', 1.75), ('interact', 1.6666666666666667), ('taught', 1.5), ('translation', 1.5), ('application', 1.3333333333333333), ('focus', 1.0), ('human', 1.0), ('goal', 1.0), ('structured', 1.0), ('languages', 1.0), ('widely', 1.0), ('mit', 1.0), ('log', 1.0), ('representations', 1.0), ('database', 1.0), ('browsed', 1.0), ('computers', 1.0), ('deals', 1.0), ('searched', 1.0), ('implement', 1.0)]
Here I set the rakc_object with a different parameters for my initial use:
rake_object = rake.Rake(“SmartStoplist.txt”, 3, 3, 1)
Each word has at least 3 characters
Each phrase has at most 3 words
Each keyword appears in the text at least 1 times
The test text is from the Coursera Natural Language Processing course introduction, and the keywords from the rake_object is:
keywords: [(‘nlp include automatic’, 8.25), (‘transform unstructured text’, 8.0), (‘hidden markov models’, 8.0), (‘structure formal models’, 8.0), (‘natural language phenomena’, 7.916666666666666), (‘natural language processing’, 7.916666666666666), (‘modern nlp systems’, 7.75), (‘machine learning methods’, 7.75), (‘natural language’, 4.916666666666666), (‘dialogue systems’, 4.5), (‘nlp technologies’, 4.25), (‘study mathematical’, 4.0), (‘information extraction’, 4.0), (‘electronic form’, 4.0), (‘vast amount’, 4.0), (‘speech data’, 4.0), (‘scientific viewpoint’, 4.0), (‘columbia university’, 4.0), (‘free grammars’, 4.0), (‘cover formalisms’, 4.0), (‘dramatic impact’, 4.0), (‘design algorithms’, 4.0), (‘flexible ways’, 4.0), (‘key problems’, 4.0), (‘linguistic data’, 4.0), (‘probabilistic context’, 4.0), (‘people access’, 4.0), (‘linear models’, 4.0), (‘curriculum closely’, 4.0), (‘professor collins’, 4.0), (‘statistical models’, 4.0), (‘computational models’, 4.0), (‘people interact’, 3.666666666666667), (‘previously taught’, 3.5), (‘application areas’, 3.333333333333333), (‘machine translation’, 3.25), (‘nlp’, 2.25), (‘language’, 2.1666666666666665), (‘text’, 2.0), (‘models’, 2.0), (‘machine’, 1.75), (‘interact’, 1.6666666666666667), (‘taught’, 1.5), (‘translation’, 1.5), (‘application’, 1.3333333333333333), (‘focus’, 1.0), (‘human’, 1.0), (‘goal’, 1.0), (‘structured’, 1.0), (‘languages’, 1.0), (‘widely’, 1.0), (‘mit’, 1.0), (‘log’, 1.0), (‘representations’, 1.0), (‘database’, 1.0), (‘browsed’, 1.0), (‘computers’, 1.0), (‘deals’, 1.0), (‘searched’, 1.0), (‘implement’, 1.0)]
It seems not good for this document, and I modified the parameters with the following settings and got another result:
# Each keyword appears in the text at least 2 times
In : rake_object = rake.Rake(“SmartStoplist.txt”, 3, 3, 2)
In : keywords = rake_object.run(text)
In : print “keywords: “, keywords
keywords: [(‘natural language processing’, 7.916666666666666), (‘statistical models’, 4.0), (‘computational models’, 4.0), (‘people interact’, 3.666666666666667), (‘language’, 2.1666666666666665), (‘models’, 2.0), (‘machine’, 1.75), (‘application’, 1.3333333333333333)]
The key points for RAKE is the parameters setting, and RAKE provides a method to select a proper parameters based on the train data. As the document summarize, RAKE is very easy to use to getting start keyword extraction, but seems lack something:
To summarize, RAKE is a simple keyword extraction library which focuses on finding multi-word phrases containing frequent words. Its strengths are its simplicity and the ease of use, whereas its weaknesses are its limited accuracy, the parameter configuration requirement, and the fact that it throws away many valid phrases and doesn’t normalize candidates.
Related Paper: Automatic keyword extraction from individual documents
This article implements the RAKE keyword extract algorithm based on NLTK, such as using the sent tokenize method to replace the origin implement in RAKE, and get the same result like RAKE.
3. Intro to Automatic Keyphrase Extraction by Burton DeWilde
I strongly recommend this document for anyone to getting started with keyword extraction or keyphrase extraction. It intro keyword extraction step-by-step, and divide keyword extraction into Candidate Identification, Keyphrase Selection with Unsupervised and supervised method with python code example. For the same testing code with a help corpus from coursera, I get the top-5 keywords by the methods of score_keyphrases_by_tfidf:
way people interact 0.269048146641
application areas 0.13452407332
application of computational models 0.13452407332
This is the result of text rank method score_keyphrases_by_textrank:
[(‘models’, 0.0619816545693813), (‘nlp’, 0.04454783455509914), (‘language’, 0.0334375900800485), (‘machine’, 0.029867774676152762), (‘course’, 0.026254400322149735), (‘application’, 0.024863645805797824)]
But for me, the interesting methods is the supervised keyword extraction, I will train a model for the courses text keyword extraction latter, just for wait.
4. Automatic Keyphrase Extraction Data
Keyword or Keyphrase extraction data is very valuable, followed from the document of “Intro to Automatic Keyphrase Extraction”, I found the AutomaticKeyphraseExtraction data from github, and following is the decription of the data:
This repository contains the datasets for automatic keyphrase extraction task.
* 500N-KPCrowd.zip data from Marujo:LREC2012 (News articles annotated using AMT)
* 110-PT-BN-KP.zip data from Marujo:Interspeech2011 (non-English AKE corpus)
* MAUI.tar.gz data from University of Waikato (KEA, MAUI systems)
* Wan2008.tar.gz data from Wan:2008
* Schutz2008.tar.gz data from Schutz:2008 (only answer sets and readme are provided. the papers are available at ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/articles.tar.gz)
* Nguyen2007.zip data from Nguyen:2007
* Hulth2003.tar.gz data from Hulth:2003
We cannot ignore the KEA algorithm for keyword or keyphrase extraction:
Keywords and keyphrases (multi-word units) are widely used in large document collections. They describe the content of single documents and provide a kind of semantic metadata that is useful for a wide variety of purposes. The task of assigning keyphrases to a document is called keyphrase indexing. For example, academic papers are often accompanied by a set of keyphrases freely chosen by the author. In libraries professional indexers select keyphrases from a controlled vocabulary (also called Subject Headings) according to defined cataloguing rules. On the Internet, digital libraries, or any depositories of data (flickr, del.icio.us, blog articles etc.) also use keyphrases (or here called content tags or content labels) to organize and provide a thematic access to their data.
KEA is an algorithm for extracting keyphrases from text documents. It can be either used for free indexing or for indexing with a controlled vocabulary.
KEA is implemented in Java and is platform independent. It is an open-source software distributed under the GNU General Public License.
Follow the Kea-5.0-Readme.txt, I compiled KEA and test it with an additional nlp.txt in the $KEAHome/testdocs/en/test which include the test text from the coursera nlp course. After run “java -Xmx526M TestKea”
Creating the model…
— Loading the Index…
— Building the Vocabulary index from SKOS file
— Reading the Documents…
Extracting keyphrases from test documents…
— Loading the Index…
— Building the Vocabulary index from SKOS file
— Extracting Keyphrases…
I got a nlp.key file which contains the extracted keywords for the nlp.txt:
It seems not good, but give me a very good suggestion how to extract course text keywords.
6. kea-service: KEA 5.0 (keyphrase extraction software), modified to be an XML-RPC service
This is XML-RPC service for Kea, and based the kea service server, you can use any code to communicate with KEA.
A related doc by the Author also provides a general introduction: KEA KEYPHRASE EXTRACTION AS AN XML-RPC SERVICE (CODE RELEASE)
This package determines important terms within a given piece of content. It uses linguistic tools such as Parts-Of-Speech (POS) and some simple statistical analysis to determine the terms and their strength.
8. tagger: A Python module for extracting relevant tags from text documents.
Here is a document to reference for me to find tagger and toipia.termextract: 3 Open Source Tools for Auto-Generating Tags for Content
9. Reference Papers
1) Automatic Keyphrase Extraction: A Survey of the State of the Art
2) SGRank: Combining Statistical and Graphical Methods to Improve the State of the Art in Unsupervised Keyphrase Extraction
3) “Without the Clutter of Unimportant Words”: Descriptive Keyphrases for Text Visualization
4) Automatic glossary extraction: beyond terminology identification
5) Unsupervised Approaches for Automatic Keyword Extraction Using Meeting Transcripts
6）A Ranking Approach to Keyphrase Extraction
7) Finding Advertising Keywords on Web Pages