Paul Boersma’s writings on the Gradual Learning Algorithm

Paul Boersma’s writings on the Gradual Learning Algorithm

The Gradual Learning Algorithm (GLA) is a learning algorithm for Stochastic Optimality Theory.

1. Definition of the Gradual Learning Algorithm

The following paper defines the GLA and applies it to lexicon-driven learning of perceptual categorization.

1997	How we learn variation, optionality, and probability. IFA Proceedings 21: 43–58. Additional material: Simulation script. Earlier version: Rutgers Optimality Archive 221, 1997/10/12 (incorrect!). Also appeared as: chapter 15 of Functional Phonology (1998).

2. Implementation of the Gradual Learning Algorithm

You can do your own GLA simulations with the Praat program. This also does Tesar & Smolensky’s EDCD as well as "learning from partial representations" (Robust Interpretive Parsing, both for EDCD and GLA). An old version of the manual is:

1999	Optimality-Theoretic learning in the Praat program. IFA Proceedings 23: 17–35 (= Rutgers Optimality Archive 380).

A more recent version is available under Help in the Praat program.

3. Empirical tests

The following paper tests the GLA on fully informed learning of production in Ilokano, Finnish and English.

2001	Paul Boersma & Bruce Hayes: Empirical tests of the Gradual Learning Algorithm. Linguistic Inquiry 32: 45–86. [copyright] Earlier version: Rutgers Optimality Archive 348, 1999/09/29.

This article contains several simulations. You can download data and simulation scripts for use with the Praat program, which allow you to replicate these simulations:

Metathesis and reduplication in Ilokano: the file Ilokano_data.txt contains the grammar and output distributions.
Genitive plurals in Finnish: the file Anttila_data.txt contains the grammar and output distributions.
Reanalyses of Reynolds’s (1994) dissertation: the file Faetar_data.txt contains Reynolds’ Faetar analysis and output distributions, plus our reanalysis. The file Reynolds_tdeletion.txt contains Reynolds’ t-deletion analysis, with output distributions; the file GLA_tdeletion.txt contains our reanalysis, with output distributions.

The following paper tests the GLA on the acquisition of basic syllable structure in Dutch.

2000

Paul Boersma & Clara Levelt:
Gradual constraint-ranking learning algorithm predicts acquisition order.
Proceedings of Child Language Research Forum 30, Stanford, California, pp. 229–237.
Note: the editors messed up the phonetic symbols; the above is the correct version (Rutgers Optimality Archive 361, 1999/08/28).

4. Other applications

The GLA is also used in all my papers on categorization, recognition, metrical phonology (partly informed learning of production), grammaticality judgments, and Parallel Bidirectional Phonology and Phonetics (BiPhon).

5. Non-convergence of the GLA

The GLA is not guaranteed to learn all languages that can be generated by a Stochastic OT grammar. It has been known from early on that some cases of variation with constraints with complementary violations do not converge, because the search space for the learning algorithm has fewer dimensions than N-1 (where N is the number of constraints). In 2005, Joe Pater even found a case where the GLA does not converge on a categorical (non-variation) language that can be represented in OT. I designed the script GLAsuccessRate.praat to find out how likely such a situation is to occur. This script will show you the following facts about learning algorithms:

Languages generated from OT rankings with crucial ties can rarely be learned by any algorithm.
Out of the languages created from a total OT ranking with default settings (11 constraints, 11 tableaus, 11 candidates per tableau, 0 to 3 violations per cell), the GLA (OT with "Symmetric all") learns only 98.5 percent.
OT with “EDCD” and plasticity one, no noise, and no plasticity spreading learns only 30 percent of these languages, because it usually stops too early. This is to be expected on the basis of the incorrect proof by Tesar (1995). See below at 6.
OT with “EDCD” and plasticity one, noise 1e-6, and no plasticity spreading learns all of these languages. See below at 6.
OT with “Demotion only” (loosely speaking a gradual EDCD for Stochastic OT) learns all of these languages (but it fails on variation cases).
OT with “Symmetric all” (i.e. the GLA) and plasticity one, no noise, and no plasticity spreading (i.e. the non-gradual version of the GLA) learns all of these languages (an observation ascribed to Giorgio Magri).
Harmonic Grammar with “Symmetric all” (i.e. stochastic gradient ascent, as in Soderstrom, Mathis & Smolensky 2006) learns all of these languages. Joe Pater claims that this is to be expected on the basis of the proof of the perceptron convergence procedure. For the proof see 7 below.
Maximum Entropy with “Symmetric all” (i.e. stochastic gradient ascent, as in Jaeger 2003) learns all of these languages. This is to be expected on the basis of the proof by Markus Fischer.

A relativizing remark: Boersma (2003) shows that these numbers may be rather different if there is a hidden representation between the input and the output. For Tesar & Smolensky’s example with 12 metrical constraints, GLA learns 70 percent of Tesar & Smolensky’s 124 languages correctly, whereas (corrected) EDCD only learns 60 percent; by the way, stochastic gradient ascent for Harmonic Grammar learns 80 percent. See 7 below.

An analogous script for variation cases is GLAsuccessRateVar.praat.

6. Non-convergence of EDCD

Some of the above conclusions are tested in EDCDsuccessRate.praat, and described in:

2009

Some correct error-driven versions of the Constraint Demotion algorithm.
Linguistic Inquiry 40: 667–686. [copyright]
Earlier version: Rutgers Optimality Archive 980, 2008/07/01 (accepted revision, with proof).
Earlier version: Rutgers Optimality Archive 953, 2008/03/07 (original submission, without proof).

7. A learning algorithm for Harmonic Grammar

A correctness proof for a learning algorithm for Harmonic Grammar (identical to the one used by Soderstrom, Mathis & Smolensky 2006) is described in:

2016

Paul Boersma & Joe Pater:
Convergence properties of a gradual learning algorithm for Harmonic Grammar. [preprint, 2013/03/13]
In John McCarthy & Joe Pater (eds.): Harmonic Serialism and Harmonic Grammar, 389–434. Sheffield: Equinox.
Earlier version: Rutgers Optimality Archive 970, 2008/05/21.

Go to Paul Boersma’s home page