Introduction to NKI-CorpusBuilder

NKI-CorpusBuilder (NCB) is an application for building and managing speech corpora on top of Praat (www.praat.org)

What is NCB

NKI-CorpusBuilder is a platform to build and manage portable spoken language corpora. It is a Graphical User Interface (GUI) in front of Praat that simplifies recurrent tasks in corpus construction and management. It also contains a number of handy scripts for these tasks.

NCB takes the view that a corpus is a type of Archive. This archive stores Media files (the recordings), primary Annotations of the recordings, Meta-data about the recordings and annotations, and human Evaluations of the recordings. As the corpus is viewed as an Archive, NCB will not overwrite existing files unless specifically instructed to do so. Moreover, NCB will not overwrite Media files, even if you instruct NCB to do so. If you want to remove or change an existing recording, you will have to use a different application. In automatic tasks, NCB will often refuse to overwrite existing non-Media files too, eg, Annotations.

NCB stores all non-Media data as text files. You are very strongly adviced to put all non-media files, eg, meta-data and annotations, into a version control system. A good option is to use git (www.git.org) to store textual data.

What NCB is not

NKI-CorpusBuilder is not constructed to record corpora. Recording can best be done using a solid state recording device and a good microphone. The importance of using a good microphone cannot be overstated. However, some limited means to record audio are given inside NCB.

Corpus structure

NCB assume a fairly general logical corpus structure. The real structure can differ a lot from the logical structure. The mapping of this logical structure to the real directories is done in the CorpusLayout.tsv file. Here, we will assume the real structure, ie, directory names etc, matches the logical structure. But you are free to structure your corpus the way you like and name the directories anything you like.

NCB recorgnizes four components in a corpus:

Except for the Documentation, each logical component of a corpus contains at least four logical subdivisions:

Missing files in the Info, Annotations, and Texts components are automatically generated when a media file is opened. Default values for the initial content of new files in these components can be given inside every directory as dot files, i.e., .Info, .Annotations, and .Texts. These files reside inside the directory they should be applied to. That is, the initial TextGrid content of a subdirectory subset in Annotations/subset will be stored in Annotations/subset/.Annotations. Times in the initial TextGrid file will be scaled to the media file.

The Corpus part contains an Overlays directory (Corpus/Overlays) which stores stand-off annotations (wiki.tei-c.org/index.php/Stand-off_markup). Each separate set of stand-off markup is stored in a separate sub-tree under Corpus/Overlays. Note that logically this is a subdirectory of the Corpus part, but it is stored default next to the Corpus directory.

Note: Adding Recordings/Overlays or Evaluation/Overlays to the CorpusLayout.tsv file will generate these overlay directories automatically.

The Evaluation part also contains some extra parts:

Links to this page


© R.J.J.H. van Son, April 4, 2012