Institute of Phonetic Sciences,
University of Amsterdam,
Proceedings 19 (1995), 71-82.

THE RELATION BETWEEN THE ERROR DISTRIBUTION AND THE ERROR RATE IN IDENTIFICATION EXPERIMENTS[*]

R.J.J.H. van Son

Abstract

The error dispersion is a perplexity related measure of the effective number of error categories used by listeners in an identification experiment. The error difference is calculated from the error dispersions of several confusion matrices and measures the difference between these confusion matrices. The error dispersion and the error difference are largely independent of the error rate. It is shown in this paper that correlations (and the absence thereof) between the error dispersion and the error rate that are found in identification experiments can be used to infer underlying regularities in the identification process. The use of these techniques is demonstrated with examples from the literature.

1. Introduction

For analyzing the results of identification experiments two different approaches are used. The simplest type of analysis uses the number of incorrect identifications, i.e., the error rate, to compare results from different experiments or for different conditions. The second, much more elaborate method uses Multi-Dimensional scaling to analyse the internal structure of the "response-space". Both methods have their strengths and weaknesses. The error rate is conceptually the simplest measure and can be used to quantify the differences between experiments and conditions, but it uses only the correct responses and ignores the incorrect ones. Multi-Dimensional scaling techniques can unravel intricate relations between the acoustic structure of stimuli, experimental conditions, and perceptual categories. However, these techniques can be used only on "full" matrices (not too low an error rate) whereas asymmetry is hard to handle which limits its use. Moreover, the level of detail given is often too high and it is difficult to quantitatively compare the results of different experiments. As a consequence, the result of a multi-dimensional scaling analysis needs a lot of "post-processing" to extract the relevant features. Beside these two broad approaches, there are other, more specialised methods of analysis, e.g., distance metrics (d', see e.g., Van Wieringen, 1995), entropy related measures (Miller and Nicely, 1955; Allen, 1994), or articulation density (Allen, 1994). These approaches have in common that they are generally too specialised for the average identification experiment (e.g., d' and articulation density) or interpretation itself is non-trivial (entropy).

In many experiments, it is often the distribution of the errors, and not their number, that is the meaningful measure. However, at the same time there is no need for detailed structural information on the distribution of the responses. It can be seen from the previous description that there is a need for analysis tools of intermediate complexity. More detailed than the error rate, but more global than scaling techniques.

This leads us to the following two questions. First, how can we quantify in a meaningful way the distribution of errors in an identification experiment? Second, can we quantify the differences between individual error distributions?

2. The perplexity

The approach taken here to answer the questions posed in the previous section is based on standard Information Theory (Khinchin, 1957; Sveshnikov, 1968; Press et al., 1988; Allen, 1994). A very useful method to analyse stimulus-response relations is to use the perplexity. The perplexity is primarily used to describe syntax complexity in automatic speech recognition (Bahl and Jelinek, 1990). In the context of this paper the perplexity is the "effective" number of responses per stimulus or stimuli per response. The perplexity is calculated from the basic entropy values of a confusion matrix (see figure 1). Based on these basic entropies the conditional entropies and perplexities are defined as respectively equations 1 and 2 (Press et al., 1988; Bahl and Jelinek, 1990): H(Resp|Stim) = H(Stim, Resp) - H(Stim) (1)
H(Stim|Resp) = H(Stim, Resp) - H(Resp)

Perplexity(Stim) = 2H(Resp|Stim) (2)
Perplexity(Resp) = 2H(Stim|Resp)

3. The error dispersion, d

The perplexity is dominated by the number of correct responses, i.e., 1 - error rate. However, the aim is to describe the distribution of the incorrect responses. This can be done by "normalizing" the perplexity for the error rate. The result is called the error dispersion, d (equation 3, see Van Son, 1994, 1995 for a derivation). ; (3)
;

With the entropy of the error-rate defined as:

H = -.2log() - (1- ).2log(1- )

The error dispersion, d, is the perplexity with the correct responses "removed". It indicates the effective number of error categories per either stimulus (ds) or response (dr).

Mathematically speaking, the error dispersion is largely independent of the absolute error rate. There remains a small indirect dependence on the distribution of the correct responses. If the distribution of correct responses is unbalanced, i.e., errors are concentrated in a subset of the stimuli and responses, a change in the error rate will induce a change in the distribution of the correct responses. This latter change again will affect the error-dispersion. This residual dependency can be "normalized" out of the error dispersion by decomposing it into an error-dependent part and a part that depends on the distribution of the correct responses. However, this approach falls outside the scope of this paper.
Figure 1. The entropy measures of a confusion matrix. H(Stim, Resp): the entropy of the stimulus response pairs. H(Stim): the entropy of the stimuli. H(Resp): the entropy of the responses. Si: the number of occurrences of stimulus i. Rj: the number of occurrences of response j. Pij: The number of occurrences of response j to stimulus i. N the total number of stimuli and responses.

4. The error difference, d

When confusion matrices are combined, i.e., experimental data are pooled, the entropies (H(Stim, Resp), H(Stim), and H(Resp)) of the resulting pooled matrix are larger than the corresponding mean entropies of the contributing matrices. The difference (i.e., Hpooled - Hmean) is a measure for the differences in the distribution of stimuli, responses, and stimulus-response pairs. The same holds for the derived measures of the perplexity and the error dispersion. This difference can be used to define a measure that quantifies the difference in the distribution of the errors between confusion matrices. The error-difference, d, is such a measure. It is defined in equation 4. Its general definition takes into account the weighting of the individual matrices, i.e., not all matrices contribute an equal number of errors to the pooled matrix (see Van Son, 1994, 1995 for a derivation). (4)

In which:

d is the fraction of the total (pooled) number of errors, pooled, that is not "shared" between the matrices (i.e., errors that are "unique" to individual matrices)

hpooled is calculated over the pooled confusion matrix

is the mean of the individual matrices (weighted by their contribution to pooled)

H is the entropy of the weighting factors i of the matrices (~ 2log[number of pooled matrices])

In words, the error difference, d, is the fraction of the total (pooled) number of errors, pooled, that is not "shared" between the matrices (i.e., errors that are "unique" to individual matrices).


Figure 2. Prototypical relations between the Error dispersion and the Error rate (see text)

The error difference is based on the logarithm of the error dispersion (i.e., on h) which is normalized with respect to the size of the confusion matrix, i.e., the error dispersion is the number of error categories per stimulus or response category. As a result of how hpooled and are calculated, differences between individual matrices due to non-overlapping stimulus or response sets are discounted from the error difference itself. Furthermore, the error difference, d, is normalized with respect to the combined stimulus or response sets. This means that the error difference is reduced when non-overlapping confusion matrices are used compared to when only the overlapping parts of these confusion matrices are used.

5. The relation between the Error Rate and the Error Dispersion

Although the error dispersion is mathematically (almost) independent of the error rate, in an actual experiment you will be looking for relations between these measures. The question is then to determine how the error dispersion changes with increasing (decreasing) error rates. The prototypical relations are given in figure 2.

In figure 2 several possible correlations between the error rate and the error dispersion are drawn as straight lines. One possibility is that both are not correlated because one of them is essentially fixed. In general, if the errors are dominated by a response bias (e.g., long/short vowel confusions) that varies in strength, the distribution of the errors does not change when the error rate itself changes. As a result, the error dispersion will be nearly fixed with error rate (the horizontal line in figure 2). At the other extreme, all errors can be concentrated in only a subset of the tokens which are always identified incorrectly (e.g., short realizations of intrinsically long vowels versus short vowels). As the specific incorrect responses vary with the experimental conditions, but the number of incorrect responses does not, the result will be a nearly fixed error rate with varying error dispersion (the vertical line in figure 2). In this case, it is an open question who made the incorrect identifications, the experimentator or the subjects.

An experiment in which either the error dispersion or the error rate is fixed will be quite exceptional. A more likely result is a positive correlation between the error dispersion and the error rate. Such a result indicates that new errors are distributed over new responses. That is, as a stimulus elicits more errors, these errors are distributed over more responses. In short, a growing error rate indicates a growing confusion. The actual slope of the linear regression line is a measure of the number of error categories that are added per "new" error.


Figure 3. Token construction for the experiment of Pols and Van Son (1993). Tokens were synthesized with formant tracks as indicated in the two plots. Nine different F1/F2 midpoint values were used, corresponding to the Dutch phonemes shown below the plots. Each parabolic formant track shape was synthesized with 4 durations. The tokens with stationary formant tracks were synthesized with 6 durations. Because of limitations of formant synthesis, not all target points could be synthesized with all formant track shapes.

A very interesting type of (cor-) relation is when the error dispersion reduces with increasing error rate, i.e., a negative correlation. This can happen when there is a fixed "background" of errors with a high error dispersion superimposed on a varying response bias with a low error dispersion. The error dispersion is then some kind of combination (not necessary linear) of the individual contributions. In this situation, the (combined) error dispersion decreases in size when the relative strength of the bias, and hence the error rate, increases. An example could be an experiment in which long/short vowel confusions are induced by varying token durations and indiscriminate errors by a fixed level of added noise (a fixed, low, Signal-to-Noise Ratio). In such a case, the error dispersion will reduce when the response bias becomes stronger, and as a consequence, the error rate becomes higher.

6. Examples from the literature[*]

The above theory is tested on a number of papers taken from the literature. When possible, the correlation coefficients are tested for statistical significance. However, statistically significant correlations are not used to infer a causal relation with experimental conditions. Such an interpretation would require a detailed analysis of the underlying experiments which is outside the scope of this paper.


Figure 4. Results of the identification experiment of Pols and Van Son (1993). See figure 3 for token construction. Left panel: Error dispersion, ds, versus error rate for each duration (ignoring long/short confusions). The light hatched squares are the results for the "stationary" tokens with durations of 12.5 and 6.3 ms. Right panel: Error difference, ds, between tokens with "opposite" formant track curvatures.

6.1. The influence of formant track shape on vowel identification

In an experiment to determine how vowel identity is influenced by the shape of formant tracks, Pols and Van Son (1993) performed a listening experiment with synthetic stimuli. For each of nine different F1/F2 "target" points distributed over the vowel triangle, five formant track shapes were constructed: one stationary and four with parabolic F1 or F2 tracks. All five track types had the same formant values in the center. These track shapes were synthesized with 4 or 6 different durations (figure 3).

Subjects were asked to identify the sounds as Dutch vowels. The results indicated a shift in the responses corresponding to an averaged formant value for the synthetic vowels rather than some form of "perceptual overshoot". The associated error dispersions, error rates, and error differences are calculated from the raw data (figure 4, Pols and Van Son, 1993, do not give error rates).

Figure 4 shows that there was a large spread in the error rates for the different conditions, from below 10% to over 40%. The individual conditions with respect to the formant track shape are nicely separated in the plot of error dispersion versus error rate. Up- and downward parabolic F1 tracks have distinct error dispersions. Up- and downward parabolic F2 tracks have distinct error rates. The short duration (12.5 and 6.3 ms) stationary tokens have high error rates and high error dispersion, which separate them from the non-stationary tokens which have comparable error rates but lower error dispersions. The fact that the error dispersions of the non-stationary tokens are all around 1 indicates that the errors are concentrated in a single category for each stimulus.

The separation between tokens with upward and downward pointing formant tracks is even more evident when the error difference is used (figure 4). Although the error dispersions of tokens with up- and downward pointing F2 tracks are equal, the error difference is over 0.5, indicating that more than half the errors are different (figure 4). For the F1 shapes, large differences are found too. However, here the error difference is reduced due to the large difference between the sets of stimuli used (6 versus 9 formant "targets", see figure 3). In total, the large error differences indicate that the errors are concentrated in different categories for different formant track shapes.


Figure 5. Token construction for the identification experiment of Van Son and Pols (1995). Starting from CVC fragments excised from read text, tokens were constructed by progressively removing more and more of the context and the transition region of the target phoneme, either the vowel or the pre- or post-vocal consonant. Either the vowels, or the pre- and post-vocal consonants were identified in different experiments.

Using a different type of analysis, Pols and Van Son (1993) concluded that the parabolic formant tracks induced an averaging in the perception of the formant values. The result is that the responses are shifted towards vowels with a lower formant target value for tokens with downward pointing ([[intersection]]) formant tracks and towards vowels with a higher formant target value for upward pointing ([[union]]) formant tracks. The errors in both cases would exclude each other. This behaviour would result in a large error difference between tokens with upward pointing and tokens with downward pointing formant tracks, as is indeed manifest in figure 4.

6.2 The influence of context on identification

Using natural (read) speech, Van Son and Pols (1995a,b) investigated how the presence or absence of nearest neighbour context affected the identification of vowels and consonants from read CVC tokens from a long, meaningful text.

Although the error rates for CV- and VC-type tokens are comparable in size, figure 6 shows that the underlying cause of the errors could be very different. The VC-type tokens show a strong correlation between error dispersion and error rate, akin to the "general confusion" line in figure 2. The CV-type of tokens combine a virtually constant error dispersion with a widely varying error rate. Vowel identification (the CVC-type tokens) behaves more like consonant identification in CV-tokens than in VC-tokens. This suggests that the errors in the identification of pre-vocalic consonants and vowels (CV and CVC in figure 6) are concentrated in a few (approximately two) "response biases", whereas the errors in the identification of post-vocalic consonants (VC in figure 6) spread out over a large number of response categories.
Figure 6. Results of the vowel and consonant identification experiments of Van Son and Pols (1995a,b). For a specification of the stimuli, see figure 5. Long/Short vowel, or consonant voicing, errors were ignored. Left panel: Error dispersion versus error rate, ds, * p <= 0.01, two tailed. Right panel: Error difference, ds, between tokens with accented and non-accented vowels.

Van Son and Pols (1995) found a consistent difference in error rate between tokens containing accented and tokens containing unaccented vowels. This difference can be seen, more or less, in figure 6, where the error rates for the unaccented tokens (open symbols) are generally higher than the error rates for the accented tokens (filled symbols). Van Son and Pols (1995) showed that this difference in error rate was consistent and statistically significant. The error differences between the accented and unaccented tokens show a difference between vowel identification and consonant identification. For the identification of the post-vocalic consonants (VC) this is to be expected, the error dispersion increases with the error rate, so if the unaccented tokens induce more errors, it is to be expected that these "extra" errors are different. For the identification of the pre-vocalic consonants (CV), the high error difference indicates that the fixed error dispersion for accented and unaccented tokens are the result of different sets of incorrect responses.

Combined, the results of this analysis indicates that there are differences between the identification of pre-vocalic and post-vocalic consonants. It seems that the patterns of identification errors of pre-vocalic consonants are more like those of vowels with respect to the number of error categories, and more like those of post-vocalic consonants with respect to the differences between accented and unaccented tokens. In their paper, Van Son and Pols (1995) do indeed conclude that there is a (cor- ) relation between the identification errors in pre-vocalic consonants and vowels, but not between post-vocalic consonants and vowels. They also find that the differences between accented and unaccented tokens are different for vowel identification and consonant identification.

Several studies compared identification results for vowels presented with and without their original, syllabic, context. Both Huang (1991) and Kuwabara (1985) published confusion matrices that are relevant to this question. These are used here.


Figure 7. Vowel identification results for presentation with and without the original, syllabic, context. The results are taken from Huang (1991) and Kuwabara (1985). Left panel: Error dispersion versus error rate. Huang (1991): CVC ( ) versus the excised (isolated) Vowel ( ) for 4 speakers. Kuwabara (1985): Trisyllabic VVV sequences ( ) versus the excised, medial Vowel in isolation ( ), 4 speakers pooled. Right panel: Error difference between presentation with and without the original syllabic context.

From both studies it can be concluded that vowels presented in context (filled symbols) are identified better than those presented in isolation (open symbols). The range of error rates is large and clearly separates the two conditions in each experiment. At the same time, the error dispersions for the two conditions overlap. It seems that the increase in the error rate caused by the removal of the context only strengthens existing errors (or response biases), and does not introduce new errors. This idea is supported by the error differences which are quite small (figure 7, right-hand panel, compare with figures 6 and 4 and be aware of the differences in vertical scale). Inspection of the confusion matrices of Huang and Kuwabara confirms this picture. There is a strong overlap of identification errors between presentation of vowels in context and in isolation.

6.3 Vowel reduction

Vowel reduction is an obvious field of research for testing the use of this new analysis technique. In figure 8, some results of the study of Van Bergem (1993) are presented. Van Bergem (1993) studied the influence of stress on vowel reduction in Dutch words. Identical syllables were recorded when pronounced in isolation, as the stressed and unstressed syllables of content words, and pronounced as function words. An English example would be: can, candy, canteen, can. The words were part of carrier sentences and were pronounced both with and without sentence accent (except for the syllables pronounced in isolation). For the actual identification experiment, the excised vowel realizations were presented in isolation. These experimental conditions were chosen to induce varying levels of vowel reduction.

From figure 8, it can be seen that for all three speakers there is a strong correlation between the error dispersion and the error rate. Clearly, the different conditions in which the vowels were pronounced induced a varying amount of vowel reduction. This reduction translated into a large variation in error rates (~15-75%) and error dispersion (~1-4). The strong correlation between error dispersion and error rate suggests that many distinctive features of the vowel realizations are progressively affected ("reduced") by vowel reduction, but in a consistent way. It also suggest that the error rate is the only relevant measure with respect to vowel reduction.
Figure 8. Results of Van Bergem, 1993. Vowels were pronounced under different conditions that were designed to vary the amount of vowel reduction. The results for all three speakers are plotted: Sp 1, Sp 2, and Sp 3. R: correlation coefficient, * p <= 0.01, two tailed

6.4 Mixed speaker vowel identification

As a final example, results are analysed from a series of experiments, aimed at determining whether listeners use context to normalize for speaker variation (Verbrugge et al., 1976; Strange et al., 1976). Tokens from several different speakers were presented in mixed order. In the two experimental "conditions", the type of context varied: none, full CVC, and only /pVp/, as well as the type of precursor sentence: maximal vowel space (/hi/ /ha/ /hu/), minimal vowel space (/hÈ/ /hoe/ /hU/), and "natural" precursor sentences (figure 9).

Despite the small number of points available, some interesting differences between the conditions can be seen. The vowels and syllables presented in isolation (open circles, figure 9) give the familiar picture of increasing error dispersion with increasing error rates (the correlation is not statistically significant) that signifies "general confusion" (figure 2). However, the syllables presented in a sentence context show a completely different picture. One point, the /pVp/ syllable presented in a natural sentence context combines a markedly increased error rate with a lower error dispersion. From the discussion of figure 2, it is known that this is the hallmark of a strong response bias superimposed on a background of "random" errors. Verbrugge et al. (1976) do reach the same conclusion from a close inspection of the response data.

7. Discussion and conclusions

In the previous sections the results of several studies were re-analysed using the relation between the error dispersion and the error rate. These studies were not designed to be analysed this way so they serve mainly as an illustration of the potential use of this technique. I do not intent a re-interpretation of their results. However, it is clear that the plots of error dispersion versus error rate can help a lot in determining whether experimental conditions do make a difference for identification (figure 4). Furthermore, the strength and direction of the correlation can indeed suggest what kind of process underlies the changes in the responses (figures 6, 7, 8, and 9). It was also demonstrated that conditions that elicit comparable error rates and error dispersion can sometimes be distinguished because of the large differences in the actual incorrect responses observed (figures 4 and 6). The reverse was also found, where large differences in error rates were based on the same type of errors (figure 7).


Figure 9. Speaker normalization results from Verbrugge et al. (1976), Strange et al. (1976). The error dispersion is plotted against the error rate. R: correlation coefficient, *: statistical significance, p <= 0.025, two-tailed. Stressed and destressed pVp syllables presented with /hi/ /ha/ /hu/, /hÈ/ /hoe/ /hU/, and "natural" precursor sentences (six points but they overlap) Vowels, CVC syllables and pVp syllables (+/- stressed) presented in isolation (four points)

Therefore, it can be concluded that the error dispersion, d, is a meaningful measure of the number of distinct errors per stimulus or response (cf., perplexity). The error dispersion can separate cases for which the error rate alone is insufficient. Combined with the error rate, it also can suggest the kind of processes that underlie the differences in responses.

Furthermore, the error difference, d, quantifies the (dis-)similarity of stimuli and conditions. It is useful to assess whether there are differences in the pattern of responses, independent of any differences in error rate or error dispersion.

8. Acknowledgements

I would like to thank dr. Dick van Bergem for making available the data of his identification experiments. This research was made possible by grant 300-173-029 of the Dutch Organization of Research (NWO).

9. References