Introduction
Research into Speech and Language needs large corpora of "natural" language.
Sizes
- hand-labeled speech data:
~50 kwords (~5 hours, IFA corpus)
- hand transliterated speech:
~10 Mwords (~1000 hours, Spoken Dutch Corpus, CGN)
- text corpora (newspapers, memo's, emails, ceefax):
>100 Mwords
- Automatic Speech Recognition:
~250 Mwords unlabeled text
~50 kwords un-labeled speech (on-site recordings)
- Automatic Text-to-Speech synthesis:
~10 kwords per speaker (~1 hour of scripted, hand-aligned speech)
Issues
- complexity
collecting, annotating, archiving, workflow
- people required
many tasks require specialists
- copyrights, licenses, privacy
blocks the use of many TV/radio broadcasts and newspaper texts
- distribution and access
the Spoken Dutch Corpus (CGN) takes up 175 CDroms