HomeText MiningGetting Started with Text Processing or Natural Language Processing

Text Processing is the one of the most common tasks in the world, this article will focus on the natural language text processing in the computer, which commonly referred to as NLP. According to the wikipedia, Text processing is defined like this:

In computing, the term text processing refers to the discipline of mechanizing the creation or manipulation of electronic text. Text usually refers to all the alphanumeric characters specified on the keyboard of the person performing the mechanization, but in general text here means the abstraction layer that is one layer above the standard character encoding of the target text. The term processing refers to automated (or mechanized) processing, as opposed to the same manipulation done manually.

And the natural language text processing can be referenced to one field of computer science: Natural Language Processing (NLP), which composition by artificial intelligence, machine learning, linguistics and etc. Seems very difficult, and how to getting started with text processing or natural language processing? Here is the recommend list we carefully selected, including books, open courses, open source tools and other learning resources.

Text Processing Books

1. Speech and Language Processing, 2nd Edition

An explosion of Web-based language techniques, merging of distinct fields, availability of phone-based dialogue systems, and much more make this an exciting time in speech and language processing. The first of its kind to thoroughly cover language technology – at all levels and with all modern technologies – this book takes an empirical approach to the subject, based on applying statistical and other machine-learning algorithms to large corporations. Builds each chapter around one or more worked examples demonstrating the main idea of the chapter, usingthe examples to illustrate the relative strengths and weaknesses of various approaches. Adds coverage of statistical sequence labeling, information extraction, question answering and summarization, advanced topics in speech recognition, speech synthesis. Revises coverage of language modeling, formal grammars, statistical parsing, machine translation, and dialog processing. A useful reference for professionals in any of the areas of speech and language processing.

2. Foundations of Statistical Natural Language Processing

Statistical approaches to processing natural language text have become dominant in recent years. This foundational text is the first comprehensive introduction to statistical natural language processing (NLP) to appear. The book contains all the theory and algorithms needed for building NLP tools. It provides broad but rigorous coverage of mathematical and linguistic foundations, as well as detailed discussion of statistical methods, allowing students and researchers to construct their own implementations. The book covers collocation finding, word sense disambiguation, probabilistic parsing, information retrieval, and other applications.

3. Natural Language Processing with Python

This book offers a highly accessible introduction to natural language processing, the field that supports a variety of language technologies, from predictive text and email filtering to automatic summarization and translation. With it, you’ll learn how to write Python programs that work with large collections of unstructured text. You’ll access richly annotated datasets using a comprehensive range of linguistic data structures, and you’ll understand the main algorithms for analyzing the content and structure of written communication.

4. Python Text Processing with NLTK 2.0 Cookbook

Use Python’s NLTK suite of libraries to maximize your Natural Language Processing capabilities. Quickly get to grips with Natural Language Processing ? with Text Analysis, Text Mining, and beyond. Learn how machines and crawlers interpret and process natural languages. Easily work with huge amounts of data and learn how to handle distributed processing

5. Text Processing in Python

Text Processing in Python describes techniques for manipulation of text using the Python programming language. At the broadest level, text processing is simply taking textual information and doing something with it. This might be restructuring or reformatting it, extracting smaller bits of information from it, or performing calculations that depend on the text. Text processing is arguably what most programmers spend most of their time doing. Because Python is clear, expressive, and object-oriented it is a perfect language for doing text processing, even better than Perl. As the amount of data everywhere continues to increase, this is more and more of a challenge for programmers. This book is not a tutorial on Python. It has two other goals: helping the programmer get the job done pragmatically and efficiently; and giving the reader an understanding – both theoretically and conceptually – of why what works works and what doesn’t work doesn’t work. Mertz provides practical pointers and tips that emphasize efficent, flexible, and maintainable approaches to the textprocessing tasks that working programmers face daily.

Open Courses related with Text Processing

1. Natural Language Processing by Dan Jurafsky and Christopher Manning (Stanford University at Coursera)

This course covers a broad range of topics in natural language processing, including word and sentence tokenization, text classification and sentiment analysis, spelling correction, information extraction, parsing, meaning extraction, and question answering, We will also introduce the underlying theory from probability, statistics, and machine learning that are crucial for the field, and cover fundamental algorithms like n-gram language modeling, naive bayes and maxent classifiers, sequence models like Hidden Markov Models, probabilistic dependency and constituent parsing, and vector-space models of meaning.

We are offering this course on Natural Language Processing free and online to students worldwide, continuing Stanford’s exciting forays into large scale online instruction. Students have access to screencast lecture videos, are given quiz questions, assignments and exams, receive regular feedback on progress, and can participate in a discussion forum. Those who successfully complete the course will receive a statement of accomplishment. Taught by Professors Jurafsky and Manning, the curriculum draws from Stanford’s courses in Natural Language Processing. You will need a decent internet connection for accessing course materials, but should be able to watch the videos on your smartphone.

2. Natural Language Processing by Michael Collins ( Columbia University at Coursera)

Natural language processing (NLP) deals with the application of computational models to text or speech data. Application areas within NLP include automatic (machine) translation between languages; dialogue systems, which allow a human to interact with a machine using natural language; and information extraction, where the goal is to transform unstructured text into structured (database) representations that can be searched and browsed in flexible ways. NLP technologies are having a dramatic impact on the way people interact with computers, on the way people interact with each other through the use of language, and on the way people access the vast amount of linguistic data now in electronic form. From a scientific viewpoint, NLP involves fundamental questions of how to structure formal models (for example statistical models) of natural language phenomena, and of how to design algorithms that implement these models.

In this course you will study mathematical and computational models of language, and the application of these models to key problems in natural language processing. The course has a focus on machine learning methods, which are widely used in modern NLP systems: we will cover formalisms such as hidden Markov models, probabilistic context-free grammars, log-linear models, and statistical models for machine translation. The curriculum closely follows a course currently taught by Professor Collins at Columbia University, and previously taught at MIT.

3. Machine Learning by Andrew Ng (Stanford University at Coursera)

Machine learning is the science of getting computers to act without being explicitly programmed. In the past decade, machine learning has given us self-driving cars, practical speech recognition, effective web search, and a vastly improved understanding of the human genome. Machine learning is so pervasive today that you probably use it dozens of times a day without knowing it. Many researchers also think it is the best way to make progress towards human-level AI. In this class, you will learn about the most effective machine learning techniques, and gain practice implementing them and getting them to work for yourself. More importantly, you’ll learn about not only the theoretical underpinnings of learning, but also gain the practical know-how needed to quickly and powerfully apply these techniques to new problems. Finally, you’ll learn about some of Silicon Valley’s best practices in innovation as it pertains to machine learning and AI.

This course provides a broad introduction to machine learning, datamining, and statistical pattern recognition. Topics include: (i) Supervised learning (parametric/non-parametric algorithms, support vector machines, kernels, neural networks). (ii) Unsupervised learning (clustering, dimensionality reduction, recommender systems, deep learning). (iii) Best practices in machine learning (bias/variance theory; innovation process in machine learning and AI). The course will also draw from numerous case studies and applications, so that you’ll also learn how to apply learning algorithms to building smart robots (perception, control), text understanding (web search, anti-spam), computer vision, medical informatics, audio, database mining, and other areas.

4. Learning From Data by Yaser S. Abu-Mostafa (CaltechX at edX)

Introductory Machine Learning course covering theory, algorithms and applications. Our focus is on real understanding, not just “knowing.” This is an introductory course in machine learning (ML) that covers the basic theory, algorithms, and applications. ML is a key technology in Big Data, and in many financial, medical, commercial, and scientific applications. It enables computational systems to automatically learn how to perform a desired task based on information extracted from the data. ML has become one of the hottest fields of study today, taken up by undergraduate and graduate students from 15 different majors at Caltech. This course balances theory and practice, and covers the mathematical as well as the heuristic aspects. The lectures follow each other in a story-like fashion:

What is learning?
Can a machine learn?
How to do it?
How to do it well?
Take-home lessons.

5. Intro to Artificial Intelligence by Peter Norvig and Sebastian Thrun (Udacity)

The objective of this class is to teach you modern AI. You will learn about the basic techniques and tricks of the trade. We also aspire to excite you about the field of AI.

Overview of AI

Statistics, Uncertainty, and Bayes networks

Machine Learning

Logic and Planning

Markov Decision Processes and Reinforcement Learning

Hidden Markov Models and Filters

Adversarial and Advanced Planning

Image Processing and Computer Vision

Robotics and robot motion planning

Natural Language Processing and Information Retrieval

Open Source Tools or Projects related with Text Processing

1. NLTK: Natural Language Toolkit

NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning.

2. TextBlob: Simplified Text Processing

TextBlob is a Python (2 and 3) library for processing textual data. It provides a simple API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more.

3. Pattern

Pattern is a web mining module for the Python programming language.

It has tools for data mining (Google, Twitter and Wikipedia API, a web crawler, a HTML DOM parser), natural language processing (part-of-speech taggers, n-gram search, sentiment analysis, WordNet), machine learning (vector space model, clustering, SVM), network analysis and canvas visualization.

4. MBSP for Python

MBSP is a text analysis system based on the TiMBL and MBT memory based learning applications developed at CLiPS and ILK. It provides tools for Tokenization and Sentence Splitting, Part of Speech Tagging, Chunking, Lemmatization, Relation Finding and Prepositional Phrase Attachment.

5. Apache OpenNLP

The Apache OpenNLP library is a machine learning based toolkit for the processing of natural language text.

It supports the most common NLP tasks, such as tokenization, sentence segmentation, part-of-speech tagging, named entity extraction, chunking, parsing, and coreference resolution. These tasks are usually required to build more advanced text processing services. OpenNLP also includes maximum entropy and perceptron based machine learning.

Posted by TextMiner


Getting Started with Text Processing or Natural Language Processing — No Comments

Leave a Reply

Your email address will not be published. Required fields are marked *