This page dedicates to a general-purpose machine learning technique called Maximum Entropy Modeling (MaxEnt for short). On this page you will find:
In his famous 1957 paper, Ed. T. Jaynes wrote:
Information theory provides a constructive criterion for setting up probability
distributions on the basis of partial knowledge, and leads to a type of
statistical inference which is called the maximum entropy estimate.
It is least biased estimate possible on the given information; i.e., it is
maximally noncommittal with regard to missing
information.
That is to say, when characterizing some unknown events with a statistical
model, we should always choose the one that has Maximum Entropy.
Maximum Entropy Modeling has been successfully applied to Computer Vision, Spatial Physics, Natural Language Processing and many other fields. This page will focus on applying Maxent to Natural Language Processing (NLP).
The concept of Maximum Entropy can be traced back along multiple threads to Biblical times. However, not until the late of 21st century has computer become powerful enough to handle complex problems with statistical modeling technique like Maxent.
Maximum Entropy was first introduced to NLP area by Berger, et al (1996) and Della Pietra, et al. 1997. Since then, Maximum Entropy technique (and the more general framework Random Fields) has enjoyed intensive research in NLP community.
Here is an (incomplete) list of tutorials & introduction for Maximum Entropy Modeling.
Here is an incomplete list of software found on the net that are related to Maximum Entropy Modeling.
A must read paper on applying maxent technique to Natural Language Processing. This paper describes maxent in detail and presents an Increment Feature Selection algorithm for increasingly construct a maxent model as well as several examples in statistical Machine Translation.
Another must read paper on maxent. It deals with a more general frame work: Random Fields and proposes an Improved Iterative Scaling algorithm for estimating parameters of Random Fields. This paper gives theoretical background to Random Fields (and hence Maxent model). A greedy Field Induction method was presented to automatically construct a detail random fields from a set of atomic features. An word morphology application for English was developed. longer version.
This paper applies ME technique to statistical language modeling task. More specifically, it builds a conditional Maximum Entropy model that incorporates traditional N-gram, distant N-gram and trigger pair features. Significantly perplexity reduction over baseline trigram model was reported. Later, Rosenfeld and his group proposed a Whole Sentence Exponential Model that overcome the computation bottleneck of conditional ME model. You can find more on my SLM page.
This dissertation discusses the application of maxent model to various Natural Language Dis-ambiguity tasks in detail. Several problems were attacked within the ME framework: sentence boundary detection, part-of-speech tagging, shallow parsing and text categorization. Comparison with other machine learning technique (Naive Bayes, Transform Based Learning, Decision Tree etc.) was given. Ratnaparkhi also had a short introduction paper on ME.
This paper describes IIS algorithm in detail. The description is easier to understand than (Della Pietra, et al. 1997), which involves more mathematical notations.
Abney applies Improved Iterative Scaling algorithm to parameters estimation of Attribute-Value grammars, which can not be corrected calculated by ERF method (though it works on PCFG). Random Fields is the model of choice here with a general Metropolis-Hasting Sampling on calculating feature expectation under newly constructed model.
Four iterative parameter estimation algorithms are compared on several NLP tasks. L-BFGS is observed to be the most effective parameter estimation method for Maximum Entropy model, much better than IIS and GIS. (Wallach 02) reported similar results on parameter estimation of Conditional Random Fields. Here is Malouf's Maximum Entropy Parameter Estimation software.
Claude Elwood Shannon's influential 1948 paper that laid the foundation of information theory and changed the whole world since then. I see no reason who has read the above papers does not want to read this one.
If you find some interesting links that are related to this topic, please feel free to write to me.
Last Change :18-Aug-2005. Please send any question to Zhang Le |