File identification codes

File identification codes

The naming convention for speech fragments and files. All fragments have the form "F 28 G 1 FT 1 A ...". These identify the speaker (sex, age, and ID), the recording (1 or 2), the speaking style/text type, and the item in this recording (slide number, sentence, word, etc.) down to the individual phoneme.

The coding scheme

The coding scheme was designed to enable the unique and descriptive identification of any part of any recording. Most importantly, it must allow the insertion and deletion of descriptions with minimal effects on other parts. Furthermore, the codes should be 'decodable' by humans so they can verify their validity. That is, using a simple list, the item that belongs to a code should be recognisable. These aims were reached by a hierarchical code that uses an alternation of alphabetical and numerical codes. From any code, it is easy to obtain the code of item that contains it (e.g., the sentence that contains a word, or the word that contains a phoneme). Furthermore, removing or inserting an item can be done without changing other codes. However, the hierarchical nature of the code (which makes it understandible) forces a reassignment of an item (e.g., a phoneme from one syllable to another) to become a deletion + insertion if the codes should remain legible (which is not necessary). However, even in these cases, changes have only a very localized effect.

Speech Style Identification Codes

Fixed content

FR: Fixed text Retold
FT: Fixed Text read aloud
FS: Fixed Sentences
PS: Pseudo (unpredictable) Sentences
FW: Fixed Word list
FY: Fixed sYllable list
FPA: fixed Pronounciation list A
FPB: fixed Pronounciation list B
FPA1ABC: AlphaBeth
FPA1NUM: NUMericals
FPA1VOW: VOWels
FPA1HVD: Vowels in /hVd/ context
FPA1VCV: Consonants in Vowel context

Variable content

VI: Variable Informal story
VR: Variable story Retold
VT: Variable Text read aloud
VS: Variable Sentences
VW: Variable Word list
VY: Variable sYllable list

Other Sounds

G: Gauge signal ([12], Tone/Noise, e.g., G1N or G2T)

File Identification Code

Identify a speech segment by speaker, session, task, slide, and position

(uses Regular Expression Codes, parse with [A-Z\ -\ ,]+ and [0-9\ -\ ,]+: [...] = excactly 1, [...]+ = one or more, [...]? = 0 or 1)

Sex [MF] : Male / Female (Boy/Girl?) This is a descriptive mnemonic. It indicates fundamental characteristics of the speaker and can be extended, e.g., 'LF', 'DM', or 'XF' could indicate 'Larynchectomized Female', 'Deaf Male', or 'Sex Change Female' (transsexual Male -> Female) respectively. In multilingual databases it should be preceded by a language code, e.g., 'DUF' for a Dutch speaking Female. Note: This can change between recordings, it is not speaker specific. The general expression is '[A-Z]+'.
Age [0-9]+ : in years. Use a '-' when indicating months, e.g., 1-4 means 1 year and 4 months. This can change between recordings. It is NOT intended to identify a specific speaker.
Subject [A-Z]+ : just an identification code. Add a unique corpus code (e.g., 'IFA') in front of the speaker ID code (e.g., 'IFAN' or 'IFA-N') for data exchange. This should identify a particular speaker.
Recording [12] : the number of the recording session (1 or 2, more if needed)

Speech Identification Code (see relevant codes)

Chunk/Slide [0-9]+ : the number of the chunk of speech based on the slide number or some paragraph number
Sentence [A-Z]+ : just the position of the sentence in the chunk, 'A', 'B', 'C' etc.. 'A' if there is no sentence (as in word lists), use 'AA' ... 'ZZ' etc. if needed
Word [0-9]+ : the number of the word in the sentence (or the chunk or list)
Syllable [S-Z]+: the position of the syllable in the word, starting from S.
Position [OKCA]?: The position of a phoneme or cluster in the syllable, Onset, Kernel, Coda, Ambisyllabic (the last is optional).
Phoneme [0-9]+ : the number of the phoneme in the syllable
Channel _[fh]m The recording microphone channel, fixed, '_fm', head-mounted, '_hm', or both, '_bm', but the latter is optional, is appended after the name. This affix should be appended to every derived file.

Special codes

Inserted_speech [\ -]+ : False starts, corrections, errors are all counted "in situ", but the "count" is preceded by a -, a repeated error etc. is preceded by a double --, higher counts receive their ordinal number between the --. For example, the third restart of the sixth sentence is coded as '-C-F', for words it would be '-3-6' (note that the repeat count is coded in the same way as the major count, alphabetic when alphabetic, numeric when numeric). WARNING: the order of these counts is not necessarily chronological.
Collections [\ +]+ : If some items are combined, use {start}++{end} . For instance, if words 3 to 6 of sentence F28G1VI3C are stored together, use F28G1VI3C3++6. If the first two sentences are combined, use F28G1VI3A++B. To indicate everything to the end of the item, leave out the second ({end}) indicator, e.g., F28G1VI3C2T++ indicates all syllables after the first (S).

Use a single + to combine non-identical multi-channel recordings, e.g., shadowing and multi-logues. For instance, if speaker F27B shadows speaker F28G, we get F27B3SH1+F28G1FY for F27B recording 3 SHadowing session 1 of the recording F28G1FY.

Subdivissions [\ .\ ,] : If a major divission must be broken down, e.g., sentences into phrases, a period or comma is inserted. For example, the second phrase of the sixth sentence becomes 'F,B' or 'F.B'. Comma's are prefered, but not all systems allow them. In syllable-parts, these divissions are optional (e.g., 'TC' instead of 'T,C').

Content descriptions of the segment

_Description A separator and textual description of the underlying sound, e.g., "_fm" for the Fixed Microphone channel of an audio file, "_LB" for the author (labeler) of a Label file. You can use whatever description you want, even pure content descriptions like "_zong" for the word or _4E3SC3 for the corresponding segment in the reference text. You can add as many description as you like, e.g., _fm_WF_zong_4E3SC3.
Note: The description gives information on the context and contents of the sound segment and its relation to the other recordings. The descriptions are optional. However, I suggest the following minimum set: Channel for the recordings and author for any type of annotation.

Example:


M56H1FT3A4SC2_fm_SM

Male, 56 yoa, Subject H, Recording session 1, Fixed Text read aloud, slide 3, sentence A, Word 4, Syllable S (i.e., first syllable), Coda, Phoneme 2, derived from the fixed microphone _fm and author SM.

Links to this page

© Rob van Son, October 16th, 2001