NKI-CorpusBuilder (EN)

NKI-CorpusBuilder (EN)

NKI-CorpusBuilder (NCB) is an application for building and managing speech corpora on top of Praat (www.praat.org)

What is the NKI-CorpusBuilder

Getting started with NCB|Getting started

Overview of Main page|Overview of the Main page

Overview of Configuration page|Overview of the Configuration page

NCB Tutorials|Tutorials

Corpus Layout

NCB Copyright and License

What's new?

Introduction to NKI-CorpusBuilder

NKI-CorpusBuilder (NCB) is an application for building and managing speech corpora on top of Praat (www.praat.org)

What is NCB

NKI-CorpusBuilder is a platform to build and manage portable spoken language corpora. It is a Graphical User Interface (GUI) in front of Praat that simplifies recurrent tasks in corpus construction and management. It also contains a number of handy scripts for these tasks.

NCB takes the view that a corpus is a type of Archive. This archive stores Media files (the recordings), primary Annotations of the recordings, Meta-data about the recordings and annotations, and human Evaluations of the recordings. As the corpus is viewed as an Archive, NCB will not overwrite existing files unless specifically instructed to do so. Moreover, NCB will not overwrite Media files, even if you instruct NCB to do so. If you want to remove or change an existing recording, you will have to use a different application. In automatic tasks, NCB will often refuse to overwrite existing non-Media files too, eg, Annotations.

NCB stores all non-Media data as text files. You are very strongly adviced to put all non-media files, eg, meta-data and annotations, into a version control system. A good option is to use git (www.git.org) to store textual data.

What NCB is not

NKI-CorpusBuilder is not constructed to record corpora. Recording can best be done using a solid state recording device and a good microphone. The importance of using a good microphone cannot be overstated. However, some limited means to record audio are given inside NCB.

Corpus structure

NCB assume a fairly general logical corpus structure. The real structure can differ a lot from the logical structure. The mapping of this logical structure to the real directories is done in the CorpusLayout.tsv file. Here, we will assume the real structure, ie, directory names etc, matches the logical structure. But you are free to structure your corpus the way you like and name the directories anything you like.

NCB recorgnizes four components in a corpus:

Documentation All documentation about the corpus
Recordings The, raw, initial recordings as made in the field (original recordings might be removed)
Corpus The corpus proper
Evaluation Any additional human evaluation done on the corpus

Except for the Documentation, each logical component of a corpus contains at least four logical subdivisions:

Media The recordings
Info The meta data as tab-separated key-value file (tsv)
Annotations Annotations are stored as Praat TextGrid files
Texts Any textual data connected to the recordings
Overlays Stand-off annotations, Corpus part only

Missing files in the Info, Annotations, and Texts components are automatically generated when a media file is opened. Default values for the initial content of new files in these components can be given inside every directory as dot files, i.e., .Info, .Annotations, and .Texts. These files reside inside the directory they should be applied to. That is, the initial TextGrid content of a subdirectory subset in Annotations/subset will be stored in Annotations/subset/.Annotations. Times in the initial TextGrid file will be scaled to the media file.

The Corpus part contains an Overlays directory (Corpus/Overlays) which stores stand-off annotations (wiki.tei-c.org/index.php/Stand-off_markup). Each separate set of stand-off markup is stored in a separate sub-tree under Corpus/Overlays. Note that logically this is a subdirectory of the Corpus part, but it is stored default next to the Corpus directory.

Note: Adding Recordings/Overlays or Evaluation/Overlays to the CorpusLayout.tsv file will generate these overlay directories automatically.

The Evaluation part also contains some extra parts:

Experiments Control files and content used to run evaluation experiments
Responses The actual response files from the evaluations

Links to this page

NKI-CorpusBuilder (EN)

Corpus Layout

The logical layout of hte corpus is stored in the CorpusLayout.tsv file as a tab-separated values table.

At the bottom of this page is an example of the CorpusLayout.tsv file. It consists of three columns labeled Key, Value, and Description separated by tabs.

Key The logical name of the component. Documentation/Speakers is a top directory Documentation and a sub-directory Speakers. This directory will receive the Speaker meta-data.
Value The real path of the component. This can be any directory inside the corpus, with any name. It can be nested inside other directories. When NCB tries to access this component, it will simply append this path to the root of the corpus.
Description A description of the component for the benefit of anyone who tries to understand the structure of the corpus.

NCB divides a corpus in four logical components. Each component is subdivided into parts again. The corresponding directories will be created when they are absent. If the value of a part is empty (or -), the corresponding part will not exist for NCB, and not created when NCB opens the corpus. So, if a corpus has no Recordings component, all the values for the Recordings/ keys should be empty. The logical components are:

Documentation All documentation about the corpus
Recordings The, raw, initial recordings as made in the field (original recordings might be removed)
Corpus The corpus proper
Evaluation Any additional human evaluation done on the corpus

Except for the Documentation, each logical component of a corpus contains at least four logical subdivisions:

Media The recordings
Info The meta-data as tab-separated key-value file (tsv)
Annotations Annotations are stored as Praat TextGrid files
Texts Any textual data connected to the recordings
Overlays Stand-off annotations, Corpus part only

The Corpus part contains an Overlays directory (Corpus/Overlays) which stores stand-off annotations (wiki.tei-c.org/index.php/Stand-off_markup). Each separate set of stand-off markup is stored in a separate sub-tree under Corpus/Overlays. Note that logically this is a subdirectory of the Corpus part, but it is stored default next to the Corpus directory.

Note: Adding Recordings/Overlays or Evaluation/Overlays to the CorpusLayout.tsv file will generate these overlay directories automatically.

The Evaluation part also contains some extra parts:

Experiments Control files and content used to run evaluation experiments
Responses The actual response files from the evaluations

A central principle of NCB is that files related to a certain media file, ie, meta-data, annotations, and text files, are stored in parallel directory paths. That is, if there is a recording in Corpus/Media/my/path/to/a/recording.wav, then the following files correspond to each other:

Corpus/Media/my/path/to/a/recording.wav Recording
Corpus/Info/my/path/to/a/recording.tsv Meta-data of file
Corpus/Annotations/my/path/to/a/recording.TextGrid Annotations of file
Corpus/Texts/my/path/to/a/recording.txt Texts

In these paths, the bold parts, Corpus/Media, Corpus/Info, Corpus/Annotations, and Corpus/Texts are logical names. The real names of these directories are taken from the CorpusLayout.tsv table. The part reading /my/path/to/a/recording. in these paths are identical for all four of the files.

Default CorpusLayout.tsv


Key  Value  Description


Documentation  Documentation  The documentation of the corpus


Documentation/Speakers  Documentation/Speakers  The documentation about the speakers


Recordings/Media  Recordings/Media  Original audio recordings


Recordings/Annotations  Recordings/Annotations  Annotations of the original audio recordings


Recordings/Texts  Recordings/Texts  Original texts used for the recordings


Recordings/Info  Recordings/Info  Information and meta data on the recordings


Corpus/Media  Corpus/Media  Corpus content: Media files


Corpus/Annotations  Corpus/Annotations  Corpus Content: Annotations


Corpus/Texts  Corpus/Texts  Corpus Content: Texts


Corpus/Info  Corpus/Info  Corpus Content: Info and meta data


Corpus/Overlays  CorpusOverlays  Alternative Corpus Annotations


Evaluation/Media  Evaluation/Media  Stimuli used in evaluations: Media files


Evaluation/Annotations  Evaluation/Annotations  Stimuli used in evaluations: Annotations


Evaluation/Texts  Evaluation/Texts  Stimuli used in evaluations: Text files


Evaluation/Info  Evaluation/Info  Stimuli used in evaluations: Info and meta data


Evaluation/Experiments  Evaluation/Experiments  Control files used in experiments with the Stimuli


Evaluation/Responses  Evaluation/Responses  Responses to the Stimuli

Links to this page

NKI-CorpusBuilder (EN)

NCB license

NKI-CorpusBuilder version 1.0

Netherlands Cancer Institute tool for Corpus Construction (NCB)

NCB is based on Praat (www.praat.org)

This application was made possible by an unrestricted research grant from: ATOS MEDICAL AB: P.O. BOX 183 SE-242 22 HÖRBY SWEDEN

This application is licensed under the GNU GPL version 2 or later (www.gnu.org/licenses/old-licenses/gpl-2.0.html)


The NKI-CorpusBuilder


Copyright © 2011 Netherlands Cancer Institute and R.J.J.H. van Son


Praat code Copyright © 1992-2011 Paul Boersma and David Weenink

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.

Links to this page