how to choose a pitch analysis method

Pitch analysis is the determination of a pitch curve from a given sound.

In Praat you can choose from at least four methods for pitch analysis:

pitch analysis by filtered autocorrelation
pitch analysis by raw cross-correlation
pitch analysis by raw autocorrelation
pitch analysis by filtered cross-correlation

These method have been listed here in order of usefulness for the average speech researcher: for measuring vocal-fold vibration frequency or intonation, you are advised to use filtered autocorrelation; for voice analysis (such as is used in the area of voice pathology), you are advised to use raw cross-correlation; and for measuring pure periodicity, you are advised to use raw autocorrelation.

Measuring intonation

From 1993 to 2023, Praat’s preferred method for measuring intonation (and for most use cases involving vocal-fold vibration) was raw autocorrelation, and this is still available if you want to measure raw periodicity (see below). From 2023 on, Praat’s preferred method for measuring intonation (and for most use cases involving vocal-fold vibration) has been filtered autocorrelation.

All these methods measure pitch in terms of the self-similarity of the waveform. If the waveform is almost identical if you shift it by 10 milliseconds in time, then 100 Hz will be a good candidate for being the F0. This idea has been used in Praat from 1993 on, using the "autocorrelation" and "cross-correlation" methods, which are nowadays called "raw autocorrelation" and "raw cross-correlation". Both methods measure self-similarity as a number between -1.0 and +1.0. From 2023 on, we also have "filtered autocorrelation" and "filtered cross-correlation", which use low-pass filtering of the waveform (by a Gaussian filter) prior to doing autocorrelation or cross-correlation.

The filtering comes with several benefits:

Filtering leads to fewer unwanted octave drops. An unwanted octave drop can occur, for instance, if you set the pitch floor to 75 Hz (the standard value for raw AC) but the F0 is 200 Hz and there is background noise. The waveshape will be self-similar over time shifts of 5 ms, but then necessarily also over time shifts of 10 ms and 15 ms, and so on. Background noise can sometimes randomly make the degree of self-similarity higher over 10 ms than over 5 ms, leading to an octave drop in the measured pitch, from 200 to 100 Hz. In raw AC and raw CC, this problem is alleviated a bit by applying an "octave cost", whose standard setting is 0.01 per octave (on a self-similarity scale from 0.0 to 1.0). However, low-pass filtering reduces the problem more strongly, by taking away the higher-frequency parts of the background noise.
Filtering leads to fewer unwanted octave rises. If you set the pitch ceiling to 600 Hz (the standard value), and the F0 is 150 Hz, and the F1 is 450 Hz, then the harmonic relation between F0 and F1 (F1 is exactly three times F0) will lead to a very strong F1 in the waveform, sometimes causing the pitch to be analysed as 450 instead of 150 Hz. This problem is exacerbated by the existence of the “octave cost”, which favours higher pitch candidates. Low-pass filtering reduces this problem, because the 450-Hz component of the waveform is weakened with respect to the 150-Hz component by a factor of 0.373 (if the pitch top is 800 Hz, and the “attenuation at top” is set to 0.03). This effect is so strong that the standard “octave cost” can be much higher for filtered pitch analysis (the standard value is 0.055 per octave) than for raw pitch analysis (where the standard value is 0.01 per octave): this still reduces the formant problem heavily, and it reduces the unwanted octave drops by making the analysis more resistent to lower-frequency parts of the background noise.
Filtering leads to better tracking of formant transitions. Fast formant movements in transitions between consonants and vowels cause waveforms to change their shape strongly in time. This reduces the self-similarity of the waveform, leading to the “raw AC” method judging some of these transitions as “unvoiced”. With low-pass filtering, the transitions become more sine-like, which leads “filtered AC” being more likely to judge these transitions as “voiced”.

On a database containing EGG recordings as a gold standard (AtakeDB), the incidence of unwanted big frequency drops falls from 0.54 (of voiced frames) for raw AC to 0.15 for filtered AC, and the number of unwanted big frequency rises falls from 0.25 to 0.10 (of voiced frames); together, these numbers indicate that the incidence of gross frequency errors is better for filtered AC than for raw AC by a factor of more than 3.

On the other hand, low-pass filtering also produces potentially unwanted side effects:

• The reduction of higher frequency components causes voicing decisions to be based on smoother curves. As this influences especially the loud parts (the vowels), this reduction may have to lead to a higher optimal “silence threshold” setting. Indeed, in testing the algorithm on AtakeDB, the best silence threshold for filtered AC seems to be 0.18, whereas for raw AC it seems to be 0.06. Accounting for recordings with a larger dynamic range than AtakeDB, the current standard silence threshold is half that: 0.03 for raw AC, and 0.09 for filtered AC. This can lead to more difficulty measuring plosive voicing (during prevoiced [b] or [d]); if you want to measure those, then you are advised to lower the silence threshold from the standard of 0.09 to e.g. 0.01 (this is also somewhat true for raw AC, where the standard value is 0.03 but one should probably use 0.01 for prevoicing).
• Generalizingly, voicing decisions seem to be less predictable for filtered AC than for raw AC, i.e. it is possible that the incidence of unvoiced-to-voiced errors and voiced-to-unvoiced errors is greater. This may have to be shown in practice. In the future, voicing decisions may have to be relegated to a trained neural network, rather than to the current simple “costs” and “thresholds”.

All in all, it seems likely that filtered AC works better in practice than raw AC (your feedback is invited, though), which is why filtered AC is the preferred pitch analysis method for vocal-fold vibration and intonation, as is shown in the Periodicity menu in the Objects window, in the Pitch menu of the Sound window, and in many places in this manual. While the standard pitch range for raw AC runs from 75 to 600 Hz, filtered AC has less trouble discarding very low or very high pitch candidates, so the standard pitch range for filtered AC is wider, running from 50 Hz to 800 Hz, and may not have to be adapted to the speaker’s gender.

Voice analysis

For voice analysis, we probably shouldn’t filter away noise, so raw cross-correlation is the preferred method.

Measuring prevoicing

The high standard “silence threshold” of 0.09 for filtered AC (or, to a lesser extent, 0.03 for raw AC) probably leads to trouble measuring voicing during the closure phase of prevoiced plosives such as [b] or [d]. You are still advised to use filtered AC, but to lower the silence threshold to 0.01 or so.

Raw periodicity

Mathematically generated periodic signals aren’t necessarily speechlike. For instance:

sine = Create Sound from formula: "sine", 1, 0, 0.1, 44100, "0.4*sin(2*pi*200*x)"
tricky = Create Sound from formula: "tricky", 1, 0, 0.1, 44100, "0.001*sin(2*pi*400*x)
... + 0.001*sin(2*pi*600*x) - 0.2*cos(2*pi*800*x+1.5) - 0.2*cos(2*pi*1000*x+1.5)"
selectObject: sine, tricky
Concatenate
Erase all
Draw: 0.08, 0.12, 0, 0, "yes", "curve"

Both the left part of this sound and the right part have a period of 5 ms, and therefore an F0 of 200 Hz. The raw autocorrelation method measures an equally strong pitch of 200 Hz throughout this signal, while filtered autocorrelation considers the right part voiceless. This is because the right part contains only components at 800 and 1000 Hz, which will be filtered out by the low-pass filter, and only very small components at 200, 400 or 600 Hz. This problem of the missing fundamental was the reason why low-pass filtering was not included in raw autocorrelation in 1993. However, this situation is very rare in speech, so for speech we do nowadays recommend filtered autocorrelation, while we recommend raw autocorrelation only if you want to measure raw periodicity.

Links to this page


© Paul Boersma 2023,2024