up previous next

Introduction

Research into Speech and Language needs large corpora of "natural" language.

hand-labeled speech data:
~50 kwords (~5 hours, IFA corpus)
hand transliterated speech:
~10 Mwords (~1000 hours, Spoken Dutch Corpus, CGN)
text corpora (newspapers, memo's, emails, ceefax):
>100 Mwords
Automatic Speech Recognition:
~250 Mwords unlabeled text
~50 kwords un-labeled speech (on-site recordings)
Automatic Text-to-Speech synthesis:
~10 kwords per speaker (~1 hour of scripted, hand-aligned speech)

complexity
collecting, annotating, archiving, workflow
people required
many tasks require specialists
copyrights, licenses, privacy
blocks the use of many TV/radio broadcasts and newspaper texts
distribution and access
the Spoken Dutch Corpus (CGN) takes up 175 CDroms