There are several confusing terms used in these studies: sentence accent, pitch accent and prominence. Literature provides no unique definition of sentence accent, but it is obvious that it must refer to some major accent in a sentence. Pitch accent is an accent-lending pitch movement, whose realization has consequences for the duration and intensity (Gussenhoven, 1984). A term as pitch accent is implemented in a linguistical and theoretical concept: an intonation grammar or an intonation system, such as the IPO intonation grammar ('t Hart et al., 1990), the TOBI intonation system, (Silverman et al., 1992), or the Rise/Fall/Connection model (Taylor, 1992). Prominence refers to a greater perceived strength of words in a sentence, or put in another way, such words are perceived as standing out from their environment (Ladd, 1996; Lehiste, 1970; Terken, 1991). Lexical stress is defined in the lexicon. Realized syllable stress makes a syllable more prominent than the surrounding syllables (see table below).
Phenomena like pitch accent and sentence accent lead to perceived prominence. In the case of sentence accent and pitch accent, words are compared with adjacent words in the sentence. In the case of realized syllable stress, syllables are compared with adjacent syllables. Realized syllable stress is perceived as the most prominent syllable in a word, whereas pitch accents are perceived as the most prominent words in a sentence.
|
domain
|
definition
|
perception of naive listeners
| |
|
prominence
|
sentence
|
words perceived as standing out from their environment
|
emphasized words
|
|
sentence accent
|
sentence
|
major accents
|
emphasized words
|
|
pitch accent
|
sentence
|
an accent-lending pitch movement perceived by an expert
|
emphasized word
|
|
realized syllable stress
|
word
|
syllable perceived as standing out from its environment
|
emphasized syllable
|
|
lexical stress
|
word
|
defined in the lexicon
|
could be perceived as an emphasized syllable
|
Several attempts have been made to classify accented and non-accented words (Kieflling, 1996; Kompe, 1997; Taylor, 1993; Ten Bosch, 1993; Storm, 1995; Wightman and Ostendorf, 1994). Literature provides several approaches for initial labeling of spoken utterances for accent and non-accent for training and testing purposes.
One approach is to label the pitch contour according to the IPO intonation grammar (Ten Bosch, 1993; Taylor, 1993). In the research of Ten Bosch, four intonation experts transcribed a speech corpus. The experts were asked to transcribe utterances by using the IPO intonation categories (the labels "1" to "5" for pitch rises and the labels "A" to "E" for falls, plus the "P" for a peak realized in one syllable). In the IPO intonation grammar pitch movements such as "A", "C" and "1" and "3" are accent-lending. With this labeled speech material a predictive model is developed and tested.
Taylor used four elements to describe tune (pitch contour); type "H" (high) or "L" (low) describe the pitch accents, "C" is used to describe a phonologically significant connection elements and "B" is used to describe the rise that may occur at phrase boundaries. In the research of Storm (1995) the speech material is labeled according to TOBI intonation system (Silverman et al., 1992).
The disadvantage of these approaches is, that only the pitch contour is taken into account, although in case of the TOBI intonation system there is also attention for the break indices. For rule-synthesis purposes it is sufficient to have intonation systems such as the TOBI intonation system, or the IPO intonation grammar. However, for speakers of the Dutch language it is not mandatory to realize an accent with a pitch movement alone, there are other acoustical features such as intensity, duration and spectral quality to mark accents. If one has the aim to improve speech recognition, it is not wise to limit oneself to accent-lending pitch movements either. Rather the variability between speakers in realizing accents and the use of different acoustical cues should also be taken into account (Kraayeveld et al., 1991).
Another approach is to label the utterances for accent or non-accent based on linguistic, semantic and phonological information (Batliner et al., 1997). In the research of Kompe (1997) and Kieflling (1996), the initial labeling of accent versus non-accent is done automatically for the ERBA corpus (Erlanger Bahn Anfragen). They assume that in each prosodic phrase one word is more prominent than all other words. Following this line they apply such rules as the right most content word of a phrase being a good candidate for sentence accent. They use these rules to label their speech material. With the help of this labeled data base they build and test a predictive model based on acoustical information. Through this initial labeling, certain words are marked as accented while the speaker has not necessarily realized them as such. The predictive model, which classifies accent or non-accent with the help of acoustical features, then gets the wrong features, because this accent might not be realized in the spoken utterance.
In our present approach, prominence is initially marked via perceptive judgments. Naive listeners will be asked to mark the words which are spoken with emphasis (this is an operational definition of prominence). The words, which are perceived by the majority of the listeners as prominent are defined as being the prominent words. With these prominent and non-prominent words a predictive model will be trained and tested.
The sentences presented in the perception experiment are not delexicalized and the listener will, next to the acoustic information, also have an expectation about which words are the prominent ones based on top-down information. Beside the influence of top-down information, we can be certain that there is something in the speech signal that makes words prominent. We must assume this because acoustical features are extracted from the speech signal to predict the prominence of a word and not linguistical or semantical features. De Pijper and Sanderman (1994) found no effect of top-down information when the listeners had to mark boundaries in delexicalized speech. However, if it turns out that there is still a strong effect of top-down information and if this is a disturbing factor, a pen and paper experiment is a possible option to test this effect.
There are some listeners who have a tendency to mark more words per sentence as prominent than other listeners do (see for more details Streefkerk et al., 1997). Therefore the individual prominence judgments will be corrected per listener, before the cumulative judgments are used as a prominence indicator. Dividing each score of the listener by the sum of the total number of prominence judgments of that listener, is a possible correction for individual listener behavior.
In further research a subset of some 500 sentences will be randomly selected from the Polyphone corpus. With this subset a perception experiment will be done in which naive listeners have to mark the words spoken in an emphasized way. The results of this perception experiment will give us the prominent and non-prominent words. With these prominent and non-prominent words we can train and test an artificial neural network for an automatic prominence classification task. We must assume that there are acoustical cues in the speech signal, which lead to the perception of prominence, and that not only the top-down information is responsible for the prominence judgments. The automatic classification task (classify prominent and non-prominent words) will be done with acoustical features.
With the help of both an expert perception experiment (judging pitch accent), and a naive-listener perception experiment (judging prominence), we want to investigate the relation between pitch accent and prominence. With an expert perception experiment and the prominence judgments of naive listeners it could be tested if words with a pitch accent are a subset of the prominent words. In figure 1 the relation between realized syllable stress, prominent words, and pitch accent is displayed. Realized syllable stress is the specification of a syllable and its domain is a word, whereas the other terms have the sentence as their domain (for further explanations, see section 1).
We expect that all pitch accents correspond to a prominence judgment, but that not all prominent words correspond to pitch accents. Such an expert perception experiment will be done with a subset of about 80 sentences from the Polyphone corpus. Independent of each other, 5 experts will label these 80 sentences for pitch accent. These data can then be compared with the prominence judgments of the perception experiment. We expect that the pitch-accented words are a subset of the prominent words. Or to say it in a different way, each pitch accented word must correspond with a prominent word but not each prominent word must correspond with a pitch accent (see figure 1).
A small pilot experiment with monotone pitch with 30 sentences was already done. In this experiment the perception of prominent words in sentences with monotone pitch is studied (see for more details, Streefkerk et al., 1997). The results of this perception experiment with monotone pitch show that listeners are still able to mark consistently some of the prominent words. From the 45 words perceived as prominent in the perception experiment under normal conditions, the majority of the listeners still perceive 6 words as prominent in the monotonized speech. This is about 13.3 % of the prominent words. In order to get a better overview of the effect of the perception of prominence with a monotone pitch this pilot experiment will be repeated with more sentences. A perception experiment with monotone pitch with about 100 sentences is suggested. The subset of 100 sentences could be selected from the larger subset of 500 sentences for which listeners had to mark prominence under normal conditions. It might be better to select those sentences for which the subjects mark quite unanimously words as prominent. For this selected subset of sentences, the pitch will be made monotonous. Naive listeners have to mark the emphasized words under this condition. The result of this perception experiment with monotone pitch will be compared with the results of the original perception experiment without any speech signal manipulations. We expect that a lot of originally prominent words are not perceived as prominent anymore. But as in the pilot perception experiment with monotone pitch there will still be some words which are perceived as prominent by the majority of the listeners (see for more details Streefkerk et al., 1997).
It would be an interesting question to see which acoustical cues in these words are responsible for the perception of prominence. It could be the vowel duration, the intensity or the spectral quality. To figure this out, another perception experiment is suggested. Only the monotonized sentences with perceived word prominence will be used in these perception experiments. We suggest to make 3 subsets of manipulated sentences, one set of duration manipulations, one set of intensity manipulations and one set of spectral quality manipulations. Phrasing and pauses could also have an effect on the perception of prominence, but as a start we suggest to manipulate only intensity, duration and spectral quality. If it turns out that also phrasing and pauses have such a strong effect on the perception of word prominence in future research we can manipulate these cues too.
The following manipulations are suggested:
*Duration:
The duration of all short and long vowels is made equal to the mean duration of the short and long vowels in all sentences. The effect of final lengthening of the vowels at the end of the sentence as well as the lengthening of the vowels in lexically stressed syllables will be ignored. The results of the perception experiment will indicate whether it will be necessary to correct for these effects also.
*Intensity:
The intensity of the vowels will be made the same as the mean intensity of the vowels in all sentences. Maybe it is useful to distinguish between open and closed vowels. Then the intensity of the open vowels is made the same as the mean intensity of open vowels and the intensity of the closed vowels is made as intense as the mean intensity of the closed vowels.
*Spectral quality:
All vowels will be reduced to schwa, so that the spectral quality is the same as the schwa spoken by that same speaker. The duration and the intensity of the original vowel must be kept, so only the effect of the spectral quality is taken away from the speech material. The other acoustical features such as intensity and duration are still available to the listener.
The sentences, consisting of one set with manipulated duration, one set with manipulated intensity and one set with manipulated spectral quality, together will form one perception experiment. These manipulated sentences will be mixed and presented to naive listeners, with the task to mark the prominent words.
The results can be put in a correspondence matrices as done in the pilot perception experiment with monotone pitch (for more details see Streefkerk et al., 1997). The 3 manipulation sets will be compared with the results from the perception experiment with monotone pitch. This gives 3 correspondence matrices. The listener judgments of the 3 manipulation sets will also be compared with each other. This gives 3 more correspondence matrices. With the help of these 6 correspondence matrices, the influence of the duration, energy, and spectral information on the perception of prominence are studied and these correlates can be ordered in terms of efficacy.
Kompe (1997) and Kieflling (1996) train several predictive models. In their research the initial accent versus non-accent labeling was done automatically based on linguistic information. The predictive model (artificial neural nets, HMM's and hybrid models) are then trained and tested to recognize the accented and non-accented words based on many acoustical features. The number of input features is very high (up to 256 features). The recognition rate is up to 82%. A disadvantage is that nothing is known about the importance of the features. 78% recognition rate for the accented versus unaccented syllables is reached in the research of Storm (1995). In the research of Ten Bosch (1993) a classification was done based on F0 information only. The recognition rate was up to 81%.
Table 3: A summary of the feature extraction of different studies.
s
= syllable onset time.
Research of
Time interval
Labeling
Pitch features
Intensity features
Duration features
Lexical features
Wightman
and
Ostendorf
*syllables
*hand labeled prominence
*max s / mean s+1
*max s / max s-1
*max s / mean -s
*min s / mean s
ratio of the final F0 to the mean F0 within a sentence
*mean energy in the syllable
*pre-boundary lengthening
*smean norm - s
*pause duration
*lexical stress
*word-final position
Kieflling
and
Kompe
*syllables and
words
*automatic labeling on semantic and linguistic information
*mean and /or median
*min, max of the onset and offset
*the position of these values relative to the end of the syllable
*regression coefficient
*root mean squared differences between F0 and the regression line
*mean or median energy
*max energy
*position of the max energy relative to the end
*regression coefficient of energy contour
*root mean squared differences between energy and the regression line
*speaking rate
*average of the phoneme duration
*class of the phoneme
*lexical stress
*word-final position
Strom
*10 ms frame
*tone labels similar to TOBI
*interpolated F0,
*3 components of F0 using different bandpass filters.
*derivatives of the 3 functions
*nasal band (30-300 Hz)
*sonorant band (300-2300 Hz)
*fricative band (2300-6000 Hz)
Ten Bosch
*vowel onset
*IPO intonation grammar
*+ 60 ms
*t0 - 60 ms
*t0 + 60 ms
*t1 - 60 ms
*t1 + 60 ms
s-1 = previous syllable onset time.
s+1 = next syllable onset time.
t = vowel onset time.
t-1 = previous vowel onset time.
t+1 = next vowel onset time.
fr = frame.
max = maximum.
min = minimum.
smean norm = the mean normalized duration of syllable duration.
pre-boundary lengthening = pre-boundary lengthening measured by the mean normalized duration of the syllable rhyme.
In the table 3, a list of input features in the studies of Kieflling (1996), Kompe (1997), Taylor (1993), Ten Bosch (1993), Storm (1995), and Wightman and Ostendorf (1994) is given. It is described what kind of initial labeling is used to test and train the predictive models. Furthermore a description of the extraction of the pitch features, the intensity features, the duration features, and in some studies the lexical features, is given.
- Pitch features: In the research of Wightman and Ostendorf (1994) the features of the F0 contour are calculated in a different way as in the other studies. Per syllable, ratios of the max F0 to the mean F0 of the next syllable (max s / mean s+1) and the previous syllable are calculated (max s / max s-1). Further the ratios of the minimum F0 and the maximum F0 to the mean F0 (max s / mean s, min s / mean s) within a syllable are calculated. An additional feature is the ratio of the final F0 and the mean F0 within a sentence.
In the research of Kompe (1997) and Kieflling (1996) the acoustical features of the F0 contour are defined in the following way. They use both the syllable and the word as time intervals for the calculation of the acoustical features. The mean or the median, the maximum, the minimum, as well as onset and the offset of F0 are calculated for each time interval. Furthermore, the regression coefficient of the F0 contour, and the root mean squared differences between the F0 values, and the respective values of the regression line are used as features in this research (see also table 3).
Strom (1995) extracts 8 features for the F0 per 10 ms frame. First of all the interpolated F0 contour was smoothened in three different degrees using bandpass filters. These 3 components of the interpolated F0 and their time derivatives are used as acoustical features for detecting accents. The 3 components describe the global or the more local behavior depending on the bandpass filter. The time derivatives give some information about the increase of the interpolated F0 and its 3 components (see for more detail table 3).
A disadvantage of this approach is that the features are calculated per 10 ms frame so the measurements are independent from the onset of the syllable or the onset of the vowel. It is shown in the research of 't Hart et al. (1990) that the position of the onset of the pitch movement influences the perception of an accent. In the research of Strom (1995) 3 feature for the energy are also calculated (see further intensity features).
In the research of Ten Bosch (1993) the aim was to classify the pitch movements according to the IPO intonation grammar. In terms of the IPO intonation grammar, experts labeled the pitch contour. In Ten Bosch's research the 5 features he uses consist of 5 pitch measurements at different times. The measurements are anchored on the vowel onset (t). The pitch is determined for the following points: t-1+60 ms, t0-60 ms, t0+60 ms, t1-60 and t1+60 ms, where t-1, t0 and t1 denote the vowel onset in the previous, current and next syllable, respectively. In this research the pitch measurements are dependent on the vowel onset, and not only pitch measurements per frame are used, as in Strom (1995), as acoustical features for the classification task. But in Ten Bosch's research the features are only calculated for the pitch, other acoustical features such as energy and duration are ignored.
- Intensity features: Wightman and Ostendorf (1994) use the mean energy in the syllable as the intensity feature.
In the research of Kompe (1997) and Kieflling (1996) the mean or median energy, the maximum energy and the position of the maximum energy relative to the end of the time interval, are calculated. Also the regression coefficient and the root mean squared difference between the energy and the regression line are used as features.
In the research of Storm (1995), 3 energy features (the nasal band 30-300 Hz, the sonorant band 300-2300 Hz and the fricative band 2300-6000 Hz) are calculated per 10 ms frame.
- Duration features: In the research of Wightman and Ostendorf (1994) the pre-boundary lengthening is measured via the mean normalized duration of the syllable. The difference between the mean normalized duration of the syllable and the syllable onset is also determined and used as a input feature. Also the pause duration are measured and used as a feature.
Kompe (1997) and Kieflling (1996) use the average normalized speaking rate for one utterance as a feature. The pauses in the utterance are neglected. The average phone duration is also an additional duration feature.
- Lexical features: Wightman and Ostendorf (1994), Kompe (1997) and Kieflling (1996) use lexical features as flags, indicating whether a given syllable has lexical stress and whether it occurs in word-final position or not. In the research of Kompe (1997) and Kieflling (1996) also use flags to identify the class of the phone being in a syllable nucleus position.
The vowels in the syllables with lexical stress will be specially marked in the label file (see figure 2). Also the position of each phoneme in a word, the positions of each syllable in a word (such as word-final) and the position of each word in a sentence can be estimated from this segmentation file.
In our research we intend to choose the following features:
*Pitch features:
For the pitch per syllable ratios of the max F0 to the mean F0 of the next syllable and the previous syllable (max s / mean s+1 and max s / max s-1) will be calculated and used as features. The ratio of the minimum F0 and the maximum F0 to the mean F0 within a syllable (max s / mean s and min s / mean s) will be calculated, and presented to an artificial neural network for classification.
*Intensity features:
The mean intensity of the lexically stressed vowel normalized for the vowel type is a good feature to calculate and to use for a predictive model. It might be useful to use the ratios to the next and previous lexically stressed syllables as well. The normalization could not just be done for vowel type but also for lexical stress. For that, the mean intensity for all vowel types in stressed and in unstressed position will be calculated.
*Duration features:
Pause duration between words is a possible duration feature. Also the duration of the syllable will be calculated and used as a feature. The duration of the vowel, corrected by the mean duration of the vowel type is an optional feature. Also for the duration features it is maybe useful to use such features as ratios of the next and previous syllables. The normalization of the duration could not just be done for vowel type (long versus short vowels) but also for position in the sentence. Then a mean duration for all vowel types in final or non-final position is calculated.
In future research, the various input features will selectively be added in such a way that they will introduce as much relevant information as possible. We expect that this will increase the recognition rate more than just by introducing a great variety of input features. The relation between the acoustical input features can be studied with the help of an artificial neural network. The trained weights of the artificial neural network can be analyzed and interpreted (see Streefkerk et al., 1997).