|9.00-9.25||Welcome with coffee and tea|
|9.30||“Knock, Knock. Who's there?” - Speaker Tracking in the BATS Project
Marijn Huijbregts, CLST, Radboud University Nijmegen
|10.00||Speech based audiovisual content annotation and contextualisation at NISV
Roeland Ordelman, Human Media Interaction - University of Twente & Research and Development - Netherlands Institute for Sound and Vision
|10.30||Unveiling Personal Memories of War and Detention
Arjan van Hessen, HMI, Twente University, Enschede & Telecats, Enschede
|11.00-11.30||Coffee and tea|
|11.30||Speech Applications based on websites – a feasibility assessment
Its Kievits, Dialogs Unlimited BV, Breda
|12.00||The challenges of forensic application of automatic speaker recognition
David van Leeuwen, TNO, Soesterberg & CLST, Radboud University Nijmegen
|14.00||ASR-based CALL: integrating automatic speech recognition (ASR) in computer-assisted language learning (CALL)
Helmer Strik, CLST, Radboud University Nijmegen
|14.30||Using Speech Technology to Assist during Pathological Speech therapy
Rob van Son, University of Amsterdam
|15.00-15.30||Coffee and tea|
|15.30||Automatic assessment of native, normally formed, read or repeated speech
Hugo Van hamme, ESAT, KU Leuven
|16.00||E-learning based Speech Therapy: generating a database of pathological speech
Lilian Beijer, Sint Maartenskliniek RD&E, Nijmegen
Creating large digital multimedia archives is no problem. With an investment of less than two hundred euros for example, it is possible to record the Dutch public television broadcast channels every single day for about a year. This archive would fill a 1.5 Terabyte hard drive and would contain over 7000 hours of video. Creating such an archive is no problem, but efficiently finding information in the archive is a challenge.
An effective method of searching multimedia archives and collections is to run automatic speech recognition on each file and to apply standard search techniques on the speech transcriptions. This makes it possible to find video fragments on basis of what has been said.
By applying speech recognition it is possible to search an archive on content words, but it is not possible to answer queries such as: “Find a video fragment where Armstrong talks about the Amstel Gold Race”. In the BATS project we attempt to solve these kinds of queries by applying speaker tracking (“Armstrong”) and topic detection (“Amstel Gold Race”).
BATS, Topic and Speaker tracking in Broadcast Archives, is a joint project of the University of Leuven and the Radboud University Nijmegen, funded by ICTRegie and IBBT. In my talk I will focus on the speaker tracking task. I will explain why it is a challenge to automatically determine the identity of each single speaker in a collection and I will describe our approach to solve this challenge.
Back to top
The audiovisual archive of the Netherlands Institute for Sound and Vision (NISV) consists of more than 700000 hours of radio, television, documentaries, films and music and is growing every day (15000 hours of video annually). As the traditional manual annotation process is costly and limited by definition, new annotation strategies need to be explored to enable access to the variety of user types, both professional and non-professional, in our present-day information society. In my talk, I will give an overview on new annotation and contextualisation strategies that are being deployed or tested within the context of the NISV archive and zoom in on strategies that make use of the speech present in audiovisual content.
Back to top
Recording and publicizing your “own” AV-recorded memories is so easy now a days, that nearly everyone can (and maybe will) do it. Of course, not all the recorded material will be of huge historical or social interest, but how to decide what is valuable and what is not? Most of the AV-recorded material is not or only sparsely enriched with useful meta-data. So, to unveil these recordings, meta-data is necessary. One of the most promising technologies for meta-data addition is automatic speech recognition: a technology used to transform the spoken speech in a sequence of adjacent, most likely said words. At least at this time, a reliable, 95% correct recognition of the speech is not possible and we have to deal with imperfections: sometimes not more than 40% of the words are correctly recognized.
Nevertheless, ASR is suitable for the unveiling of spoken memories and the last years we see an increasing number of such projects. In this talk we will present an overview of two upcoming Oral History projects: Sobibor and MATRA.
In the Sobibor project 35 interviews with “nebenkläger” (relatives of people killed in Sobibor) and survivors of the Sobibor camp are aligned. Because not all interviewees speak Dutch, multilinguality becomes an issue here.
In the MATRA project 500 inhabitants of Croatia will be interviewed about their memories of the Yugoslavian civil war (1991 and 1995). Full speech recognition for Croatian does not exist yet, so other technologies will be used to unveil these data. Moreover, because only a few people can understand Croatian, full translations in English and automatic term-translation in other languages will be done in order to unveil the data as much as possible.
Back to top
The principle of basing speech applications on websites (“the principle”) is usually frowned upon by both IT and Speech experts. The process is often referred to as “screen scraping”, indicating a lack of understanding of the technological aspects. Also the opportunities that it offers are not always valued for their huge potential.
This presentation aims to discuss pros and cons of the principle, by putting the benefits of the value added applications in perspective with technological possibilities and constraints. The fundamental advantage of the principle is that in the current web centric world many benefits can be obtained from a standardized web-interface as the single source for all communication channels. This way of interfacing allows for speed and efficiency in the creation and life-cycle management of quickly evolving content and service concepts.
Through the implementation in some commercial websites the reading out of texts in websites in the web browser by using Speech Synthesis is for many people already a familiar phenomenon. Speech input offers at least the same potential. Some areas and solutions that could or already benefit from both are:
• Designing and prototyping speech applications for self service.
• Integrating multi-channel applications for computers and mobile devices.
• Powerful multi-channel solutions e.g. Employee Customer Satisfaction Feedback and ICT Helpdesk.
• Multi-modal use of computers and mobile devices e.g. for general handsfree or making them more accessible for people with special impairments.
The success of the deployed applications is strongly determined by the capabilities and constraints of speech technology. Some factors that cannot be resolved by the application developer and require a fundamental approach are:
• Dealing with incomplete or irrelevant information
• Dealing with “real” natural language and adoption of foreign words.
• Dealing with background noise, background voices and environmental acoustics (speech recognition only).
Back to top
Automatic speaker recognition is an area in speech technology that is enjoying increasing interest from the research community. In recent years, the application of this technology to the forensic domain is being investigated. Here the general idea is that a recording of an incriminating speech utterance can help to identify the perpetrator of a crime. The first application scenario is to use speaker recognition technology in the criminal investigation: to narrow down the search of suspects using the recording. In a second stage, the application scenario is to use the technology to produce evidence to support the hypothesis that a suspect is the source of the recording.
The speaker recognition community is in general very careful about the application of the technology to new domains, and in this presentation, some aspects of both application scenarios are put forward. The specific challenges and necessary research directions are reviewed, and where possible a comparison to current practice is made.
Back to top
More and more computer-assisted language learning (CALL) applications have 'speech inside'. However, in most cases the speech is produced by the system, i.e. speech is output. The CALL system reads utterances, avatars or movies are shown, and the student has to listen and respond (usually, by means of a mouse or a keyboard). In some of these CALL systems the student is also asked to speak. What these systems do with these utterances spoken by the students differs, e.g. nothing at all, or the speech is recorded to give the teacher the possibility to listen to it (afterwards), or the student immediately has the opportunity to listen to (and/or look at a display of) the recorded utterance, and possibly compare it with an example of a correctly pronounced utterance.
In a few systems automatic speech recognition (ASR) is used to give more detailed feedback. ASR can be briefly described as the conversion of speech into text by a computer. The performance of ASR systems has gradually improved over the last decades, but ASR is certainly not error-free, and probably it will never be, especially for so-called a-typical speech (speech of non-natives or people with communicative disabilities). An important question then is, when and how ASR can usefully be incorporated in applications, such as CALL applications. In my presentation, I will make clear what ASR can and what it cannot (yet) do, within the context of CALL, a-typical speech. Although ASR is not error-free it can successfully be applied in many applications, if one carefully takes its limitations into account. The most well-known application at the moment is probably the reading-tutor, but there are other possibilities. I will present some examples of such applications.
Back to top
Pathological speech developing as a result of oncological treatment has a significant negative impact on the quality of life of patients. Studies have shown that improvements of speech quality and intelligibility can indeed significantly improve the quality of life of patients. To achieve these improvements in clinical treatment, the speech quality of individual patients needs to be evaluated and followed to address his or her specific problems and to collect evidence for selecting the best treatment course.
Currently, pathological speech can only be evaluated by, scarce, human judges using subjective measures. The use of panels of human judges is not feasible during routine treatment. Moreover, subjective human evaluations are less than optimal for evidence based treatment selection. Therefore, efforts have been recently made to introduce objective methods and automatic evaluations of the intelligibility and quality of pathological speech to improve reliability and reduce cost.
Two such initiatives will be discussed, from the universities of Erlangen/Nürnberg and of Gent. Both systems have been used in clinical practice. The Erlangen/Nürnberg system uses a standard ASR system trained on normal speech. The word-error-rate of the ASR is correlated to the intelligibility of the speech. The Gent system uses a speech feature recognizer trained on normal speech with a back-end that is trained to correlate recognized speech features to intelligibility.Currently only very little is known about the way human and automatic speech recognizers “react” to pathological speech. An obvious way to study this is to generate bench-mark synthetic speech with well defined pathologies. Recent attempt to synthesize and manipulate pathological speech for such aims will be discussed.
Back to top
In reading education and speech therapy, teachers and therapists often need to assess if a known utterance is pronounced up to the expected standard. While training their reading skills, regular pupils as well as persons with a reading disorder may produce reading miscues. One of the tasks of the teacher or therapist is to detect these (reading skill evaluation) and give corrective feedback (training). In another setting, persons who have lost their hearing and who have a cochlear implant need to be trained to their new bionic ear. A therapist will read a sentence which the patient is to repeat as accurately as possible.
In the therapy and evaluation setting of the above examples, a one-on-one setting is used in practice. This is an expensive solution in terms of labour cost as well as in terms of logistics to bring patient and therapist together. Reading training is often done collectively in today’s classrooms, but a more personalized training is in demand. The result is that the number of one-on-one practice hours is reduced from the ideal. This calls for computer programs that incorporate automated methods of speech assessment and that the pupil/patient can use in addition to the scheduled contact hours. Additionally, automated methods have the advantage to have endless patience and do not suffer from examiner bias, i.e. apply the same metrics to all, irrespective of examiner, place, time and history.
In this contribution, we show how speech recognition technology can be applied to come to an automated assessment. We describe a method for dealing with imperfect phone recognition while exploiting acoustic, lexical and phonotactic knowledge as well as knowledge of the intended sentence. Finally, by giving performance data in real settings, we show what we can and cannot expect from automated speech assessment.
Back to top
In Nijmegen, a web application for speech training in neurological patients with dysarthric speech has been developed. This web application, E-learning based Speech Therapy (EST), provides patients with diminished speech intelligibility due to neurological diseases (e.g. stroke or Parkinson’s disease) with the possibility to practice speech in their own environment. The key point of the EST infrastructure is a central server, to which both therapists and patients have access. The server contains audio files of both target speech and patients’ pathological speech. Therapists are enabled to remotely compose a tailor-made speech training program, containing audio files of target speech. Patients have access to these files and attempt to approach the target. They can upload their own speech to the central server, thus generating a database of pathological (i.e. dysarthric) speech. Therapists are allowed to monitor their patients’ uploaded speech across time by downloading and analyzing speech files.
Apart from therapeutic benefits, EST, automatically generating a database of dysarthric speech, provides researchers in the field of speech technology with a large amount of speech data. For the purpose of developing tools for automatic error detection in speech or automatic recognition of dysarthric speech, this source of pathological speech is vital. On the long term, the results might enhance communicative independence of patients with various degrees of dysarthria. Moreover, new developments in de the field of automatic speech recognition of severely dysarthric speech might be applied in domotica.
Back to top
De lezingendag zal plaatsvinden in de grote conferentiezaal van het Max Planck Instituut voor Psycholinguïstiek te Nijmegen. Deze zaal vindt u links achter in de entreehal van het instituut.
Tijdens de lunchpauze zullen broodjes, soep, fruit, en verschillende soorten dranken verkrijgbaar zijn in de kantine van het Max Planck Instituut.
De weg vindt u via: http://mpi.nl/institute/visitors/how-to-get-there/how-to-get-to-the-institute/copy_of_how-to-get-to-the-institute.