Dual-license your content for inclusion in The Perl 5 Wiki using this HOWTO, or join us for a chat on irc.freenode.net#PerlNet.

Natural Language Processing

From PerlNet

Jump to: navigation, search

Natural Language Processing is a branch of artificial intelligence that deals with analyzing, understanding and generating the languages that humans use naturally. It is also referred to as computational linguistics. Perl's ability at text processing makes it an excellent candidate for use in this field. All of the modules on CPAN are open source, so if you want to study how this class of AI algorithms are implemented, this is a good place to look too.

Contents

Regular Expressions

Perl is famous for it's embedded regular expressions which provides flexible and powerful text processing abilities. Modules like Regexp::Common save us from having to reinvent the wheel for common types of text. However, once you start trying to extract meaning from written text, Regexps are not what you want.

Bare AI

A common AI problem which has seen some success in recent years, and is well implemented in software for which source code is available is email spam filtering. Here the computer's job is to classify spam (email the user doesn't want) and ham (email the user wants to read). Bayesian statistics has been one of the most successful approaches to this classification problem and has been implemented in perl by Ken Williams from Sydney University. The AI::Categorizer namespace provides perl libraries for Bayesian and other computer based classification schemes.

Producing Text

The Lingua namespace on CPAN provides a few libraries for helping computer programs to produce grammatically correct text. Lingua::EN::Inflect provides libraries to inflect properly. Here's a particularly neat example from Advanced Perl Programming by Simon Cozens:

   perl -MLingua::EN::Inflect=PL -le 'print "There are 2, ",PL("aide-de-camp")'

which outputs "There are 2 aides-de-camp".


Two other modules in this domain are Lingua::EN::Words2Nums and Lingua::EN::Nums2Words.

Comprehending Text

Languages are complex, and English especially so (I brought the fork which I bought ... ask any English teacher). To get the computer to start to extract semantic features, text needs to be split in a way that represents the underlying grammar. Lingua::EN::Sentence does this for sentences, and is the simplest of this kind of algorithm. Once you have a sentence, you have to find out what's in there. Lingua::EN::Tagger, Lingua::Stem::En, Lingua::EN::Splitter and Lingua::EN::StopWords and more can all help to do this job. However, this stuff is fairly complex and it can take quite a long time to get things working properly. In addition, good language tools are commercially valuable, and as a result, Open Source implementations of these kinds of algorithm are relatively undeveloped compared to their commercial counterparts.

Named Entity Extraction

Need to find proper nouns in text? Use Lingua::EN::NamedEntity - don't use Name::Find unless you want a lot of false negatives.


Keyword/Summary Extraction

Lingua::EN::Summarize and Lingua::EN::Splitter are good starts here. However, YMMV.

Useful article resources

Links

Personal tools