ABSTRACTS of PUBLICATIONS in 1993

Back to publications listing

Bergem, D.R. van (1993). 'Acoustic vowel reduction as a function of sentence accent, word stress, and word class' . Speech Communication, 12, 1-23.
(available as a Postscript document, 1222k)

The effect of sentence accent, word stress, and word class (function words versus content words) on the acoustic properties of 9 Dutch vowels in fluent speech was investigated. A list of sentences was read aloud by 15 male speakers. Each sentence contained one syllable of interest. This could be a monosyllabic function word, an unstressed syllable of a content word, or a stressed syllable of a content word. The same syllable occurred in all three conditions. Sentence accent was manipulated with questions that preceded the sentences. A total number of 3465 vowels were segmented from the syllables and analysed. It was found that all three factors mentioned above had a significant effect both on the steady-state formant frequencies (F1 and F2) and on the duration of the vowels. Word stress and word class had a stronger effect on the vowels than sentence accent. A listening experiment showed the perceptual significance of the acoustic measurements. It appeared that spectral vowel reduction could be better interpreted as the result of an increased contextual assimilation than as the tendency to centralize. We also studied changes in the dynamics of the formant tracks due to the experimental conditions. It was found that formant tracks of reduced vowels became flatter, which supports the view of an increased contextual assimilation. Three simple models of vowel reduction are discussed.

Top of Page
Back to publications listing

Bergem, D.R. van (1993). 'On the perception of acoustic and lexical vowel reduction' . Proceedings Eurospeech'93, Berlin.
(available as a Postscript document, 1250k)

The present study was designed to investigate how well listeners are able to unambiguously categorize an unstressed vowel in a word as either a full vowel or a schwa. It was found that listeners disagree in many cases on the assignment of a vowel to either of these categories. This suggests that listeners cannot properly distinguish between acoustic reduction (the loss of spectral quality of a full vowel) and lexical reduction (the substitution of a full vowel with a schwa). Other points of interest in the present study were the frequency of occurrence of words and speech styles; both were found to have a considerable influence on the process of vowel reduction.

Top of Page
Back to publications listing

Bezooijen, R. van & Pols, L.C.W. (1993). 'Evaluation of text-to-speech conversion for Dutch' . In: V.J. van Heuven & L. C.W. Pols (Eds.), Analysis and synthesis of speech. Strategic research towards high-quality text-to-speech generation, Mouton de Gruyter, Berlin, 339-360.

Top of Page
Back to publications listing

Boersma, P. (1993). 'An articulatory synthesizer for the simulation of consonants' . Proceedings Eurospeech'93, Berlin.

Top of Page
Back to publications listing

Bosch, L.F.M. ten, Wang, X. & Pols, L.C.W. (1993). 'Duration modelling with Hidden Markov models' , J. Acoust. Soc. Am. (A).

Top of Page
Back to publications listing

Clement, C.J. & Os, E.A. (1993). 'Development of vocalisations of severely hearing impaired infants' , Third Congress of the International Clinical Phonetics and Linguistics Association, Helsinki, Finland.

Top of Page
Back to publications listing

Dijk, J.S.C. van (1993). 'Combinations of tone' . Symposium Biophysics of hair cell sensory systems (Paterswolde), 325.

Top of Page
Back to publications listing

Heuven, V.J. & Pols, L.C.W. (Eds.) (1993). Analysis and synthesis of speech. Strategic research towards high-quality text-to-speech generation, Mouton de Gruyter, Berlin, XXII + 420 pp.

Top of Page
Back to publications listing

Institute of Phonetic Sciences (Ed.) (1993). 'Letteren zonder fonetiek is taal zonder spraak' . Report of the Institute of Phonetic Sciences Amsterdam 121, 16 pp. (in Dutch).

Top of Page
Back to publications listing

Koopmans-van Beinum, F.J. (1993). 'Speech: individual expression of universal development?' , To be published in Antwerp Papers in Linguistics. (in Dutch: 'Spraak: individuele expressie van universele ontwikkeling?').

Top of Page
Back to publications listing

Koopmans-van Beinum, F.J. & Os, E. A. den (1993). 'Speech development: perception, production, and interaction in the first year of life' . To appear in Stem- Spraak- en Taalpathologie. (in Dutch: 'Spraak in ontwikkeling: perceptie, produktie en interaktie in het eerste levensjaar' ).

Top of Page
Back to publications listing

Koopmans-van Beinum, F.J. & Wieringen, A. van. (Eds.) (1993). 'Profile of the Institute of Phonetic Sciences 1993' , Report of the Institute of Phonetic Sciences Amsterdam 122, 60 pp.

Top of Page
Back to publications listing

Kuijpers, C.T.L. (1993). 'Temporal coordination in speech development. A study on voicing contrast and assimilation of voice' . Ph.D. thesis University of Amsterdam, 165 pp.

In this thesis we study several temporal aspects in the speech development of Dutch children, concentrating on the voiced-voiceless distinction and on assimilation of voice. The most important research question is: how does the process of temporal coordina- tion develop in the speech of children, and, above all, how does it develop towards the adult model? The term 'temporal coordination' refers to how segment durations are realized, and how the durational aspects of segments are influenced by each other. This theme returns each time in the subsequent chapters.
First, we pay attention to an existing language-independent model which presents the development of sound production in infants. This model shows that, during the first year of life, phonation and articulation interact. We consider the interaction of voicing and non-voicing as an early process of temporal coordination. The various experiments described in this thesis show that temporal coordination remains an important process in speech development. Actually, in each experiment we study the process from a different angle, and the experiments are characterized by several 'underlying aspects' (Chapter I).
Next, we argue for a natural setting in phonetic experimental research with children. This is illustrated by means of three pilot studies. The first two pilot studies relate to the difference between imitative and spontaneous speech; we concentrate on syllable durations, vowel durations, and initial voiced and voiceless stops in the speech of two Dutch children at the ages of 2;3 and 2;6 (in years; months). In the third pilot study we go further into the development of the voicing contrast of initial /p b/ vs. /t d/. We study a group of children the age of 1;5 and 3;9. The results show that the realization of /b/ and /d/ is often characterized by the insertion of a schwa-like sound. Actually, the stops are in intervocalic position now.
These observations give rise to the main research questions: how do the temporal para-meters of intervocalic stops develop (in speech production and perception), and how do they develop in two-obstruent sequences which are characterized by assimi- lation of voice. Before passing on to a description of the production and perception experiments, we pay attention to the mechanism of voice production as well as to several mea-surement procedures that are generally used to chart voicing. Next to it, we concentrate on the acoustic parameters that characterize the intervocalic voicing contrast in Dutch, and to the measurement procedure that we actually chose (Chapter II). In the production experiment we investigate the intervocalic voicing contrast (/b/ vs. /p/ and /d/ vs. /t/), and the vowel contrast (short vs. long). Therefore, we analyse closure duration, burst duration, and preceding vowel duration in spontaneous but controlled speech utterances of four-year-olds, six-year-olds, twelve-year-olds, and adults. In adult speech, closure duration of voiced stops are relatively short, whereas those of voiceless stops are relatively long. Furthermore, the adult speakers display a 'temporal compensation': vowels before voiced consonants are relatively long, and vowels before voiceless consonants are relatively short. In both contexts, the total duration of the vowel-consonant (VC) sequence remains the same.
Several developmental tendencies are deduced from this part of the study. With respect to both voicing and vowel contrast, we express the durational contrast first in terms of a 'relative contrast' (ratio voiced/voiceless, and ratio short vowel/long vowel). Our data show that two contrastive sounds become more and more distinct with age, and we call this an increasing 'distinctivity'. We express this developmental aspect by means of a measure of distinctivity which is based upon the overlap of two frequency distributions: the smaller the overlap, the higher the distinctivity. The temporal compensation takes form between the age of four and six, and it develops from a sequential coordination (no temporal compensation) towards a progressive coordination (incomplete compensation), to finally reach the complete coordination (total compensation).
Furthermore, the four-year-old children realize the voiceless stop /k/ with a relatively short closure duration as compared to /p/ and /t/. We attempt to explain this fact by the lack of the velar contrast /k/ vs. /g/ in Dutch, as well as by some physiological factors (Chapter III).
Another research question relates to the perception of the word-medial voicing contrast. How does it develop, is there a parallelism between production and perception, or do children perceive more or better than they can produce? Therefore, we study the same age groups and we use natural but manipulated speech. The silent interval is manipulated from 10 msec up to 130 msec which, in general, results in a categorical perception of voiced and voiceless stops. We are forced to use nonsense words, and we invent a game as identification experiment.
The statistical analyses are based upon curve fitting: the data are transformed to z- scores and a regression analysis is carried out. The most important differences are found between the two younger age groups (4 and 6) and the two older age groups (12 and adults). With respect to the phoneme boundary (the 50% crossover), which tends to decrease with age, the data show no significant differences between the age groups. The most important difference between the younger and the older age groups concerns the phoneme boundary width (the interval between the 25% and 75% point on the identification function). This can also be deduced from the increasing steepness of the function.
As children grow older, they respond more and more accurately to the durational difference that brings about the voiced-voiceless distinction. Apparently, the acoustic difference they need to perceive the voicing distinction diminishes. We show that this age-dependent, auditive acuity relates to the consistency of judgments in the different age groups. The adult listeners behave as a homogeneous group, whereas the young children display a variable perceptual behaviour. On the basis of our data, we show that the relation between production and perception of the voicing contrast is characterized by a parallelism in development. In both experiments distinctivity of the phonemic contrast increases with age (Chapter IV).
Furthermore, we examine the development of assimilation of voice in intervocalic two-obstruent clusters. Assimilation of voice occurs rather frequently in speech of Dutch adults (e.g. 'stropdas' (tie) pronounced as [strobdas]). In general, this phenomenon is represented by an ordered set of phonological rules which indicate when voice assi-milation occurs, when it is progressive (voiceless cluster), and when it is regressive (voiced cluster). Until now, phonetic research on assimilation of voice in Dutch has only been performed with adults, and we do not know whether voice assimilation occurs in the speech of children.
We analyse spontaneous but controlled speech utterances of six-year-olds, twelve- year-olds, and adults. The four-year-old children did not participate in the experiment for practical reasons. All clusters are sequences of two obstruents, viz. stop-stop, fricative-stop, stop-fricative, and fricative-fricative. In addition, we examine the clusters in compound words (e.g. 'voetbal' (football)), as well as across a word boundary (e.g. 'groot beest' (big animal)). We call this a difference in 'linguistic context'. With respect to the children's speech utterances, we make use of adjusted criteria in order to classify the different types of assimilation, and to compare the number of occurrences with those found in adult speech.
The results show that the clusters with a rightmost fricative are always assimilated progressively, by the children as well as the adults. This confirms the phonological rule. The other phonological rules, however, are less clearly present in the children's data. Concerning the fricative-stop and stop-stop clusters, the most important difference between the groups of children (6 and 12) and the adults, relates to the type of assimilation. Children do not assimilate less than adults, but they assimilate differently. Whereas regressive assimilation is predominant in the adult speech utterances, pro- gressive assimilation predominates in the speech of both groups of children. The children tend to devoice the two-obstruent clusters irrespective of the linguistic context. From a phonetic point of view, children do not seem to be able to anticipate on the voicedness of the rightmost obstruent, and they possibly cannot realize or maintain vocal cord vibration during a relatively long obstruction. Consequently, the cluster becomes voiceless. From a phonological point of view, it seems that children generalize the voiceless character to clusters in compound words as well as in two-word items, i.e. they insert the unmarked value [-voice]. We interprete the children's data within an autosegmental framework, which leads to a parsimonious description of children's voice assimilation and the possible learning process (Chapter V).
In the final chapter we integrate the main findings of the experiments in a general discussion. We indicate the limitations of the present research, but we also make suggestions for future research. This is done in conformity with the underlying aspects, as described in the first chapter. All comments are arranged according to language- related and person-related aspects. All in all, we hope that the many experimental choices and findings reported in this thesis will contribute to further firmly-based phonetic research (Chapter VI).

Top of Page
Back to publications listing

Kuijpers, C.T.L. (1993). 'Temporal aspects of the voiced-voiceless distinction in speech development of young Dutch children'.
Journal of Phonetics, 21, 313-327.

Several temporal phenomena have been examined in the speech of four-year-old and six-year-old Dutch children. Intervocalic closure and burst durations of voiced and voiceless stops, as well as preceding vowel durations, were compared to study developmental patterns. Although the younger children produce longer segment durations, relative differences in voiced and voiceless closure and burst duration seem to correspond between the two age groups. In the same way, relative durational differences between phonologically short and long vowels are produced in an adult-like way by both of these groups of children. However, the temporal adjustment between vowel and consonant in the VC sequence displays a developmental trend. Although adult-like co-ordination of vowel and closure duration in the VC sequence with voiced context has been acquired at the age of four, only the older children have a relative shortening of the vowel in the voiceless context. The durational differences can be interpreted as evidence of development from a "syllable-independent" mechanism towards a "syllable-integrated" mechanism with increase of consonantal influence across the syllable boundary.

Top of Page
Back to publications listing

Laan, G.P.M. & Bergem, D.R. (1993). 'The contribution of pitch contour, phoneme durations and spectral features to the character of spontaneous and read aloud speech' . Proceedings Eurospeech'93, Berlin.
(available as a Postscript document, 771k)

The separate contribution of the intonation contour, phoneme durations, and spectral features of an utterance to the speech style character was studied by means of a listening experiment. Speech was used from 2 male speakers who each told "spontaneously" something about themselves and afterwards read out their own transcribed text. Utterances were selected that were identical in wording and that were fluently spoken in both speech styles. The prosodic features pitch, duration, and energy were systematically exchanged between the two speech styles by means of TD-PSOLA. Subjects in the listening experiment were asked to classify the stimuli as either spontaneous or read. It appeared that intonation, phoneme durations, and spectral features all contain cues to a particular speech style, albeit that their separate influence does not dominate over the rest of the information sources of a speech style.

Top of Page
Back to publications listing

Leeuwen, H.C. van & Lindert, te R. (1993). 'Speech maker: a flexible and general framework for text-to-speech synthesis and its application to Dutch' . Computer Speech and Language, 7 (2), 149-167.

Top of Page
Back to publications listing

Pols, L.C.W. & Son, van R.J.J.H. (1993). 'Acoustics and perception of dynamic vowel segments' , Festschrift in honor of prof. H. Fujisaki, Speech Communication, 13.

Top of Page
Back to publications listing

Son, R.J.J.H. van & Pols, L.C.W. (1993). 'How does speaking rate influence vowel articulation' . In: V.J. van Heuven & L. C.W. Pols (Eds.), Analysis and synthesis of speech. Strategic research towards high-quality text-to-speech generation, Mouton de Gruyter, Berlin, 171-191.

Top of Page
Back to publications listing

Son, R.J.J.H. van & Pols, L.C.W. (1993). 'Vowel identification as influenced by vowel duration and formant track shape' . Proceedings Eurospeech'93, Berlin. (available as a Postscript document, 771k)

Synthetic vowels were used to investigate how listeners use vowel duration and formant track shape to determine vowel identity. The synthetic vowels had level or parabolically shaped formant tracks and variable durations. They were presented in isolation as well as in synthetic Consonant-Vowel- Consonant syllables. There was no evidence of perceptual compensatory overshoot for expected target-undershoot due to token duration or context. The only asserted effects of duration and context were in the number of long- and short-vowel responses. There was also no evidence that the listeners used the formant track shape or slopes independently to identify the synthetic vowel tokens. Tokens with curved formant tracks were mainly identified on their formant offset frequencies.
Keywords: Vowel perception, perceptual-overshoot.

Top of Page
Back to publications listing

Son, R.J.J.H. van & Pols, L.C.W. (1993). 'Spectro-temporal features in vowel segments' . Ph.D. thesis, IFOTT Studies in language and language use 3, University of Amsterdam.

In this thesis we have investigated several aspects of the spectro-temporal structure of vowel segments, both concerning vowel production as well as vowel perception. Chapter 1 contains a summary of current models on vowel production and perception. Models of vowel pronunciation try to explain why vowel realizations vary so much in natural speech. It is known that vowel production is influenced in highly systematic ways by context, stress, and speaking style (among others). The classical explanation is that of the target-undershoot model. This model states that vowel articulation is limited by the speed of the articulators (e.g., jaw, tongue, lips). Each vowel has a unique target-position for each of the articulators which will produce the ideal, or canonical, realization of that vowel. When vowel realizations are very long, there is ample time for the different articulators to reach their respective target positions. However, when vowel duration is short and the context forces the articulators to cover relatively large distances, there is not enough time and the articulators are stopped short of their targets. The resulting vowel realizations show "undershoot" in their articulatory movements as well as in the resulting formant frequencies, hence the name of the model: target-undershoot.
The classical quantitative study of Lindblom (1963) on the relation between vowel duration and formant-undershoot is discussed in depth. It showed that formant- undershoot increased exponentially with a decrease in vowel duration. However, subsequent studies gave ambiguous results. Some studies did find clear evidence for articulatory- and formant-undershoot. Others showed that there were numerous cases were no relation between vowel duration and target-undershoot could be found. Especially, changes in stress and speaking style could bring about changes in duration that were not accompanied by changes in target-undershoot. In our opinion, these conflicting results can be explained by assuming that target-undershoot is planned by the speaker. In this view, the undershoot serves a purpose that depends on factors like context, prosody, and speaking style. From this it follows that, irrespective of vowel duration, the undershoot itself should not change if the purpose of the undershoot does not change and vice versa.
Considering the conflicting reports in the literature, it seems that any test of the target-undershoot model should introduce changes in vowel duration without changing stress, speaking style, or other prosodic factors that were known to cause ambiguous results. In this study, we settled for changes in speaking rate. A long, meaningful text, read at a normal and at a fast rate, would induce a speaker to use the same stress assignments and the same "style" of speaking, irrespective of reading speed. At the same time, a difference in speaking rate would change the duration of all the vowels. In this study (chapters 2-4), we used all realizations of seven different vowels and some realizations of the schwa (/@/). If vowel duration could control formant-undershoot all by itself, then an increase in speaking rate should induce an increase in undershoot. However, if formant-undershoot is planned, then a change in speaking rate should not necessarily result in a change in formant-undershoot.
In chapter 2, we measured formant frequencies in the vowel kernel. Vowel realizations uttered at the normal speaking rate were compared to the corresponding realizations uttered at the fast speaking rate. No spectral vowel reduction was found that could be attributed to a faster speaking rate. There was also no change in the amount of coarticulation or stress-induced reduction as a result of speaking rate. The only systematic effect was a higher F1 value in fast-rate speech irrespective of vowel identity. This possibly suggests a generally more open articulation of vowels, speaking louder, or some other general change in speaking style by our speaker when he speaks fast.
In chapter 3 we looked at the effects of speaking rate on vowel formant track shape, using the same material as in chapter 2. The formant track shape was assessed on a point-by-point basis, using 16 samples at the same relative positions in the vowels. Differences in speaking rate only resulted in the same uniform change in F1 frequency already found in chapter 2. Within each speaking rate, there was only evidence for a weak leveling off of the F1 tracks of the open vowels /A a/ with shorter durations. When considering sentence stress or vowel realizations from a more uniform, alveolar- vowel-alveolar context, these same conclusions were reached.
In chapter 4 we again looked at the effects of speaking rate on formant track shape. This time we used a more elaborate method for assessing formant track shape. Legendre polynomial functions were used to model and quantify the shape of time normalized formant tracks. No differences in these normalized formant track shapes were found either that could be attributed to differences in speaking rate. A uniform higher F1 frequency in fast-rate speech relative to normal-rate speech was found. Within each speaking rate, there was only evidence of a weak leveling off of the F1 tracks of the open vowels /E A a/ with shorter durations. Again, as in chapter 3, separately inspecting vowel realizations from a more uniform, alveolar-vowel-alveolar context, did not alter our conclusions.
The target-undershoot model of vowel production inspired a complementary model of vowel perception (Lindblom and Studdert-Kennedy, 1967). As vowel formant tracks will systematically undershoot the canonical target values in natural speech, it was suggested that listeners would compensate for this undershoot automatically by systematically overshooting the formant frequencies actually reached in perception, i.e. perceptual-overshoot. Early studies with synthetic speech did indeed find this kind of perceptual-overshoot. However, it showed to be rather difficult to prove the existence of an automatic mechanism for perceptual-overshoot in natural speech.
At the moment, there are two classes of models on vowel perception. The first class are models with dynamic-specification. In these models it is assumed that listeners use dynamical information from the Consonant-Vowel and/or Vowel-Consonant transitions to improve the recognition of the, stationary, vowel nucleus. Perceptual-overshoot is just one of such models. The second class of models is based on the assumption that a single, spectral, cross-section of the kernel of a vowel realization contains all information necessary to recognize it. In these models the vowel on- and offset transi- tions are of minor importance in vowel recognition.
The difference between these two types of models is the position of the Consonant- Vowel transition (in the vowel on- and offset). Is it used in vowel recognition, as is stated by models using dynamic-specification, or is it not, as stated by target models? There is evidence for perceptual-overshoot in synthetic speech. It is also known that presenting syllables without a vowel kernel, i.e. with only the vowel on- and offset transitions, hardly impairs vowel recognition. Still, there is no undisputable proof that the recognition of isolated, monphthongal, vowel segments is improved by adding dynamical information to the formant tracks. Exactly such an improvement is expected when listeners use dynamic-specification of vowels.
In natural speech, the amount of variation in durations, vowel formant frequencies and track shapes is limited. These various types of variation are furthermore strongly correlated. It is therefore better to use synthetic speech, for which it is possible to control all features. With synthetic speech, it is also possible to detach formant track shape from formant frequency. This way, the effects of formant track shape can be studied independently of vowel identity and vowel duration. We therefore choose to use synthetic speech to study how vowel duration and formant track shape influence vowel identity. Especially we looked for any evidence for perceptual-overshoot. The result of this study is presented in chapter 5 (see below). In chapter 6 we took a closer look at the existing literature in order to try to find an explanation for the disagreement between our results and those presented in several earlier papers.
In chapter 5 we used synthetic vowels to investigate whether listeners use vowel duration and formant track shape to determine vowel identity. The synthetic vowels had level or parabolically-shaped formant tracks and variable durations. They were presented in isolation as well as in synthetic CVC syllables. There was no evidence of perceptual compensation for expected target-undershoot due to token duration or context. The only asserted effects of duration and context were in the number of long- and short-vowel responses. There was also no evidence that the listeners used the formant track shape or slopes independently to identify the synthetic vowel tokens. Tokens with curved formant tracks were generally identified near their formant offset frequencies.
The results of chapter 5 contradicted claims made in the literature about the way listeners use dynamical information to identify vowel realizations. The literature on vowel perception itself also contains contradictory claims regarding the use of information from CV-transitions in vowel recognition. Our own experiments showed that the information in formant track shape was not always used to compensate for formant-undershoot. In chapter 6 a re-evaluation of the literature is attempted. A closer study of the most relevant papers shows that evidence for compensatory processes, i.e. perceptual-overshoot and dynamic-specification, was only found when vowel realizations from different, and appropriate, context were contrasted. Some studies show that vowel recognition deteriorated when vowel segments were presented out of context. Together, these facts suggest that the presence of an appropriate context is essential for any perceptual compensation of coarticulatory changes. This speculation might be used as a starting hypothesis for further research on vowel perception. Finally, in chapter 7 we summarize and discuss our findings. We recapitulate the methods used in chapters 2 to 4 to study the effects of speaking rate on formant- undershoot. We argue that, under the circumstances used, any excess undershoot due to an increase in speaking rate should have been detectable, but did not show up. We therefore conclude that, for our speaker, speaking rate did not influence the amount of vowel formant-undershoot or the formant track shape. Therefore, we can conclude that changes in vowel duration alone do not change the amount of target-undershoot and The listening experiments presented in chapter 5 showed that our listeners did not use a perceptual-overshoot mechanism or dynamic-specification to help them identifying the synthetic vowel tokens. In general, they seemed to use the offset part of each vowel realization to identify it. We therefore conclude that listeners do not automatically and unconditionally compensate for the formant-undershoot that can be predicted from the formant track shape.

Top of Page
Back to publications listing

Stelt, J.M. van der (1993). 'Intersubjectivity: a forgotten aspect in communication development?' . To be published in Antwerp Papers in Linguistics (in Dutch: 'Intersubjektiviteit: een vergeten aspect in de communicatie-ontwikkeling').

Top of Page
Back to publications listing

Stelt, J.M. van der (1993). 'Finally a word: a sensori-motor approach of the mother-infant system in its development towards speech' . Ph.D. thesis, IFOTT Studies in language and language use 4, University of Amsterdam.

In this thesis an approach is presented of mother and infant as a sensori-motor system which develops in a speech communication system. In the approach three fundamental characteristics of human communication systems are in focus: intersubjectivity, intentionality, and turntaking. These are present right after birth, although not yet in forms generally known in adult communication systems.
In normal mother-infant interaction both partners adapt their behaviours to create a context of social exchanges. These set the stage for the further development of speech communication. Any abnormalities in this development without obvious physical or mental causes (such as a hearing loss or Down Syndrome) are proposed to originate from early mother-infant interactions.
Two normal mother-infant pairs with different interaction patterns were chosen as test-cases for the approach. The development of these pairs appeared clearly to differ during the research period (from birth to the second birthday of the infants).

In the first chapter the reader is introduced to the human mother-infant interaction in its unique configuration. The fact that mothers and infants are successful in their development towards establishing conversations, that are also understandable for other humans, leads to the idea of underlying processes, generally present in mother-infant interaction. Psycholinguistic and psychobiologic literature is presented and related to publications on speech production and language development originating from linguistics, medicine, ethology, and primate evolution. It is concluded that mother and infant form a system that cannot be fully described by characteristics of these two individuals. The two persons mutually regulate each other' s behaviour -to an extent not yet fully understood- which is called coping.
In previous research in collaboration with Koopmans-van Beinum (1979, 1986), I have described speech motor landmarks in infant sound production that are basic to adult sound production. In the framework of the Netherlands Prevention Fund project, I have observed mother-infant interactions by focusing on their movements during the first two years. These experiences have led to some working hypotheses on the development of speech in infants and on styles of interaction.
In the present approach, the literature data and the practical experiences have merged. Mother-infant interaction was evaluated as early as possible, in single pairs, and in naturalistic home situations. Only the movements were described because such an approach is independent of language and the interpretation of the observers. The three common characteristics of human communication systems are treated in separate chapters. However, intersubjectivity, intentionality, and turntaking are related: intentionality presupposes an intersubjective orientation towards another person, while turntaking occurs upon transmitted intentions.

The second chapter introduces two mother-girl pairs, their medical histories over the two years, psycho-social characteristics (like infant temperament and scores on the Bayley Scales on infant development), and linguistic scores. The differences between the two pairs at the end of the observation period, i.e. when the children are two years old, are supposed to result from the different interaction patterns already present soon after birth.
Video-recording procedures, equipment, frequency, and durations are presented. These components originate from a Netherlands Prevention Fund project. Subsequently, the video-recordings of the two pairs as made during the two years have been transcribed in detail by means of a micro-analytic transcription system for movements. All movements of the mother and the infant, which occurred during five minutes per recording, were coded with regard to the body parts moving and the sounds produced. This results in a 16-channel behavioural score, similar to a musical composition for different instruments. This transcription was computer-assisted, and performed by one sole transcriber. Consistency of the transcriber was checked and appeared to be satisfactory (84% as a mean).
Not all movements made by one partner are actually seen by the other. For example, when the infant is looking at the camera, she surely will not see a smile movement on the face of the mother. The procedure to decide upon the classification 'transmitted or not transmitted movements' is described as a sensori-motor transmission model, in which memory for previous movements is neglected.
A computer program FP used for counting was adapted for duration measures. This enabled the calculation of the overall and median durations of specific codes per recording. The micro-analytic data were processed by the program PROGRAAF. This program can select specific channels or codes, indicated for mother or infant from the original transcriptions. In this manner the decomposed movement patterns in the 16 channels can be compiled selectively to obtain more complex behavioural patterns.

Intersubjective tuning is discussed in the third chapter. It is the first characteristic of mother-infant communication systems, and stands for the mutual notion that another human being is present. In the literature on mother-infant interaction it is described in positive terms like togetherness and bonding. In a way, intersubjectivity is already present before the birth, i.e. when the mother is thinking of the baby as a new person. Our approach employs the transcriptions of movements of mother and infant, and thus intersubjectivity must be translated into movements in which mother and infant mutually orient towards each other.
Three forms of tuning by means of the visual and vocal-aural channels were selected for evaluation. A comparison was made per recording and per pair of the percentage of time and the frequency of the instances of (1) the mother and infant looking at each other's face, (2) their simultaneously producing sounds, and (3) sound production being simultaneous during face-to-face contact.
These three forms of intersubjective tuning appeared to be different for the two pairs in different periods of the development. In one pair (Claire and mother EVE) the presence of face-to-face contact appeared systematically to be less frequent than in the other pair (Fanny and mother SUSAN). Simultaneous sound production was more frequent for Claire and EVE in the first five recordings only. The frequency of vocalisation in unison during face-to-face contact appeared to be higher for Claire and EVE in the first five months, and lower than for Fanny and SUSAN after the first five months.
The impact of these results for the development of speech communication is discussed. After the fifth month Claire and EVE used the two channels more selectively than Fanny and SUSAN who preferred to use the two channels simultaneously. In a book-reading-situation, Claire and EVE no longer looked at each other but visually focused on a picture; this can immediately be given an audible label, which is an efficient way of communication.

In the fourth chapter transmission of intentions is discussed. It is related to the frequencies that a person can see, hear, and interpret movements of a partner. In the literature, intentionality of young infants still is a matter of discussion, in which consciousness and goal-directedness play a major role. In mother-infant interaction an inequality seems to be present, but the mutual readiness to interpret and react to the partner's movements functions as if intentions are transmitted.
During face-to-face contact mother and infant can perceive each other's movements. In my approach, visual intentions of a person are assumed when mimical and head movements are seen by his partner during face-to-face contact. As audible intentions are assumed those sound productions that occur during face-to-face contact, and as intense intentions those combined visual and audible intentions.
The two mother-infant pairs were compared with regard to the three kinds of transmitted intentions. Intra-pair comparisons were made because the mother is expected to transmit more audible intentions to the infant than the infant to the mother, probably thereby instructing the infant about the mother tongue. Inter-pair comparisons of the mothers and the infants were also made of the percentage of time and of the frequencies because intersubjective tuning was different. Equally, the infants were compared, to check if they offered comparable amounts of intentions to their mothers' interpretation.
During face-to-face contact the transmission of visual intentions appeared not to be different for the mother and the infant of one pair. However, when comparing the children, Claire appeared to transmit more visual intentions to her mother than Fanny did. During face-to-face contact EVE transmitted more visual intentions to Claire than SUSAN to Fanny, but this difference was not yet significant in the first five months. Mothers transmitted, as expected, significantly more audible intentions to their children than vice versa. Already in the first five months this difference was present, although more clearly for Claire and EVE than for Fanny and SUSAN. The children did not differ, while the mothers differed only with regard to the percentage of time and not for the frequency. This means that EVE's sentences had a longer duration than SUSAN's during face-to-face contact.
The transmission of intense intentions was not different for both pairs: the mothers and infants were roughly similar. Within the pairs EVE, however, differed from Claire, because she systematically used the intense intentions during mutual gaze. The impact of these differences on the development of speech communication is interpreted in the realm of speech instruction, in which the visual information about sound production (the audible intentions) are expected to become redundant.

Turntaking in its simplest form is treated in this thesis in the fifth chapter. It is a well-known aspect in communication systems, and can be regarded as a kind of feedback mechanism. Turntaking implies intentionality and intersubjectivity. In the literature cyclic behaviour is described from an early age onwards, like in gazes at the face of the mother and away from it. After about the fourth month alternated sound production becomes more prominent in mother-infant interaction.
Turntaking by the mother is described upon landmark sound productions (laryngeals, simple articulations, babbling sounds, and words) of the infants. The land- mark sounds represent, on the one hand, the ongoing speech motor development of the infants and new sound productions, and, on the other hand, these sounds increasingly resemble adult speech sounds. The mothers are supposed to take audible turns upon these sounds within a certain inter-speaker switch-pause. The mother's turntaking was analysed only with regard to the onsets of her utterances because the mothers differed in the amount of sound productions. Per group of landmark sounds the percentages of infant sounds with a mother-turn were compared for the two pairs. Both infants produced sounds in the four groups of sound productions studied. Two of these groups (laryngeals and simple articulations) had their onset in the first two recordings of the infants. EVE took her turns abundantly upon the sounds of Claire. SUSAN took some turns upon Fanny's early landmark sounds, but did so more consistently when the babbling sounds occurred. Fanny was then 32 weeks old. Feedback on sound production started much later for Fanny than for Claire. Fanny produced many more babbling sounds than Claire, possibly because she finally got audible reactions of her mother. One of the conclusions is that feedback on later appearing sound productions cannot compensate for the lack of it during the first five months. The impact for speech development is clear: parents should play the conversational game with their very young infant and should enjoy even the simple sound productions. They will recognise words in the sound stream, and probably sooner than they expected.

The final chapter integrates the previous chapters, and discusses the chosen approach in relation to the results. A surprising result is the crucial impact of interaction patterns, especially during the first five months, upon the outcome of the speech developmental processes at the age of two. The sensori-motor approach has enabled us to formulate suggestions about how the fundamental characteristics of speech communication systems are gradually mastered by the mother and the infant. Further research is suggested in line with the possibilities of the sensori-motor approach. When speech developmental problems can be predicted already early in mother-infant interaction, such problems can probably be prevented to a large extent as well.
An outline is given for a method to evaluate mother-infant interaction in a laboratory setting. Depending upon the further elaboration of the present ethological approach, and practical and economical consequences, mother and infant pairs that are at-risk for communicative problems, may request for early guidance.

Top of Page
Back to publications listing

Vroomen, J., Collier, R. & Mozziconacci, S. (1993). 'Duration and intonation in emotional speech'. Proceedings Eurospeech'93, Berlin.

Top of Page
Back to publications listing

Wang, X. (1993). 'Durational modelling in HMM-based speech recognition: towards a justified measure' . Proceedings NATO-ASI on New Advances and Trends in Speech Recognition and Coding, Bubion (Granada).

Top of Page
Back to publications listing

Weenink, D.J.M. (1993). 'Modelling speaker normalization by adapting the bias in a neural net' . Proceedings Eurospeech'93, Berlin.

Top of Page
Back to publications listing

Wang, X., Bosch, L.F.M. ten & Pols, L.C.W. (1993). 'Impact of dimensionality and correlation of observation vectors in HMM-based speech recognition' . Proceedings Eurospeech'93, Berlin.

Top of Page
Back to publications listing

Wieringen, A. van, Cullen, J.K. & Pols, L.C.W. (1993). 'The perceptual relevance of CV- and VC- transitions in identifying stop consonants: cross-language results' . Proceedings Eurospeech'93, Berlin.

Top of Page
Back to publications listing

Wijnen, F., Krikhaar, E. & den Os, E.A. (1993). 'The (non) realisation of unstressed elements in children's utterances: a rhythmic constraint?' . Journal of Child Language (in press).

Top of Page
Back to publications listing

Zanten, E. van, Damen, L.W.M. & Houten E. van (1993). 'Collecting data for a speech database' . In: V.J. van Heuven & L. C.W. Pols (Eds.), Analysis and synthesis of speech. Strategic research towards high-quality text-to-speech generation, Mouton de Gruyter, Berlin 207-222.

Top of Page
Back to publications listing