Sound: To TextGrid (voice activity)...

A command that creates a TextGrid for the selected Sound in which the silent intervals and the intervals with voice activity are marked. The discrimination between the two is based on a spectral flatnes measure.

Voice activity detection (VAD) is a method to discriminate speech segments from input noisy speech. According to the article of Ma & Nishihara (2013), spectral flatness is a measure of the width, uniformity, and noisiness of the power spectrum. A high spectral flatness indicates that the spectrum has a similar amount of power in all spectral bands, and the graph of the spectrum would appear relatively flat and smooth; A low spectral flatness indicates that the spectral power is less uniform, and this would be more typical for speech-like sounds. In general speech is a highly non-stationary signal while background noise can be considered stationary over relatively longer periods of time.

Because the spectral flatness measure is completely independent of the overall intensity of the sound we have added the possibility to also discriminate on intensity.

Settings

Time step (s)
determines the time interval between consecutive measurements of the spectral flatness measure.
Long term window (s)
determines the window duration for the calculation of the long term spectral flatness measure. According to Ma & Nishihara (2013) a value of approximately 0.3 s performed best on average for a number of different noise conditions.
Short term window (s)
determines the interval for averaging spectral estimates. According to Ma & Nishihara (2013) a value of approximately 0.1 s performed best on average for a number of different noise conditions.
Frequency range (Hz)
determines the frequency range used in the calculation of the spectral flatness measure. Ma & Nishihara (2013) used a range from 400 to 4000 Hz. Because fricatives tend to have strong components above 4000 Hz we increased the default value to 6000 Hz. In this way the fricative's intensity, which is calculated from this range, becomes higher and because of this a fricative is less likely to be skipped by a selection on the silence threshold. We also decreased the lower value from 400 to 70 Hz. In this way we increase chances that sounds at start or end positions with mainly low frequency components, like nasals, are detected.
Flatness threshold
determines whether a frame is considered sounding or not based on a spectral flatness measure. Values of the flatness below the threshold are considered sounding.
Silence threshold (dB)
also determines whether a frame is considered sounding or not, but based on intensity. Intervals with an intensity smaller this value below the sound's maximum intensity value are considered as silent intervals. The intensity is calculated from the frequency range defined above.
Minimum silence interval duration (s)
determines the minimum duration for an interval to be considered as silent. If you don't want the closure for a plosive to count as silent then use a large enough value.
Minimum sounding interval (s)
determines the minimum duration for an interval to be not considered as silent. This offers the possibility to filter out small intense bursts of relatively short duration.
Silent / Sounding interval label
detemine the labels for the corresponding intervals in the newly created TextGrid.

Algorithm

The VAD algorithm is described in Ma & Nishihara (2013).


© djmw, March 17, 2021