Sound: To TextGrid (speech activity)...

Sound: To TextGrid (speech activity)...

A command that creates a TextGrid for the selected Sound in which the non-speech intervals and the intervals with speech activity are marked. The discrimination between the two is based on a spectral flatness measure.

Speech activity detection, in the technical literature often referred to as voice activity detection, is a method to discriminate speech segments from input noisy speech. According to the article of Ma & Nishihara (2013), spectral flatness is a measure of the width, uniformity, and noisiness of the power spectrum. A high spectral flatness indicates that the spectrum has a similar amount of power in all spectral bands, and the graph of the spectrum would appear relatively flat and smooth; A low spectral flatness indicates that the spectral power is less uniform, and this would be more typical for speech-like sounds. In general speech is a highly non-stationary signal while background noise can be considered stationary over relatively longer periods of time.

Because the spectral flatness measure is completely independent of the overall intensity of the sound we have added the possibility to also discriminate on intensity.

Settings

Time step (s): determines the time interval between consecutive measurements of the spectral flatness measure.
Long term window (s): determines the window duration for the calculation of the long term spectral flatness measure. According to Ma & Nishihara (2013) a value of approximately 0.3 s performed best on average for a number of different noise conditions.
Short term window (s): determines the interval for averaging spectral estimates. According to Ma & Nishihara (2013) a value of approximately 0.1 s performed best on average for a number of different noise conditions.
Frequency range (Hz): determines the frequency range used in the calculation of the spectral flatness measure. Ma & Nishihara (2013) used a range from 400 to 4000 Hz. Because fricatives tend to have strong components above 4000 Hz we increased the default value to 6000 Hz. In this way the fricative's intensity, which is calculated from this range, becomes higher and because of this a fricative is less likely to be skipped by a selection on the non-speech threshold. We also decreased the lower value from 400 to 70 Hz. In this way we increase chances that sounds at start or end positions with mainly low frequency components, like nasals, are detected.
Flatness threshold: determines whether a frame is considered speech or not, based on a spectral flatness measure. Values of the flatness below the threshold are considered speech.
Non-speech threshold (dB): also determines whether a frame is considered speech or not, but based on intensity. Intervals with an intensity smaller than this value below the sound's maximum intensity value are considered as non-speech intervals. The intensity is calculated from the frequency range defined above.
Minimum non-speech interval duration (s): determines the minimum duration for an interval to be considered as non-speech. If you don't want the closure for a plosive to count as non-speech then use a large enough value.
Minimum speech interval (s): determines the minimum duration for an interval to be considered as speech. This offers the possibility to filter out small intense bursts of relatively short duration.
Speech / Non-speech interval label: determine the labels for the corresponding intervals in the newly created TextGrid.

Algorithm

The speech activity algorithm is described in Ma & Nishihara (2013).

The logarithm of the speech flatness at frame m is defined as:

L (m) = Σ_k log (GM(m, f_k) / AM (m, f_k)),

where GM(m, f_k) and AM (m, f_k) are the geometric and arithmetic means for spectrum component f_k, respectively. The geometric mean GM (m, f_k) is defined as

GM(m, f_k) = {Π^m_n=m-R+1 S(n, f_k)}^(1/R)

where the number of frames R is determined by the setting of the long term window parameter. AM(m, f_k) is defined as

AM(m, f_k) = {Σ^m_n=m-R+1 S(n, f_k)} / R

The short term window comes into play in the definition of the S(n, f_k), because this is itself the average of M local spectral frames

S(n, f_k) = {Σ^M_p=m-M+1 |X(p, f_k)|²} / M,

where the number of frames M is determined by the setting of the short term window length.

The ratio between the geometric and arithmetic mean is always smaller than or equal to one. Only when all numbers are equal, this means a flat spectrum, the ratio becomes equal to one.