Sound: To Sound (blind source separation)...

Analyze the selected multi-channel sound into its independent components by an iterative method.

The blind source separation method to find the independent components tries to simultaneously diagonalize a number of CrossCorrelationTables that are calculated from the multi-channel sound at different lag times.

Settings

Time range (s)
defines the time range over which the CrossCorrelationTables of the sound will be calculated.
Number of cross-correlations
defines the number of CrossCorrelationTables to be calculated.
Lag times
defines the lag time τ0 for the CrossCorrelationTables. These tables are calculated at lag times τk=(k - 1)τ0, where k runs from 1 to numberOfCrosscorrelations.
Maximum number of iterations
defines a stopping criterion for the iteration. The iteration will stops when this number is reached.
Tolerance
defines another stopping criterion that depends on the method used.
Diagonalization method
defines the method to determine the independent components.

Algorithm

This method tries to decompose the sound according to the instantaneous mixing model

Y=A·X.

In this model Y is a matrix with the selected multi-channel sound, A is a so-called mixing matrix and X is a matrix with the independent components. Essentially the model says that each channel in the multi-channel sound is a linear combination of the independent sound components in X. If we would know the mixing matrix A we could easily solve the model above for X by standard means. However, if we don't know A and we don't know X, the decomposition of Y is underdetermined. This means there are an infinite number of possible combinations of A and X that result in the same Y.

One approach to solve the equation above is to make assumptions about the statistical properties of the components in the matrix X: it turns out that a sufficient assumption is to assume that the components in X at each time instant are statistically independent. This is not an unrealistic assumption in many cases, although in practice it need not be exactly the case. Another assumption is that the mixing matrix is constant, which means that the mixing conditions did not change during the recoding of the sound.

The theory says that statistically independent signals are not correlated (although the reverse is not always true: signals that are not correlated don't have to be statistically independent). The methods implemented here all follow this lead as follows. If we calculate the CrossCorrelationTable for the left and the right side signals of the equation above, then, for the multi-channel sound Y this will result in a cross-correlation matrix C. For the right side we obtain A·D·A′, where D is a diagonal matrix because all the cross-correlations between different independent components are zero by definition. This results in the following identity:

C(τ)=A·D(τ)·A′, for all values of the lag time τ.

This equation says that, given the model, the cross-correlation matrix can be diagonalized for all values of the lag time by the same transformation matrix A.

If we calculate the cross-correlation matrices for a number of different lag times, say 20, we then have to obtain the matrix A that diagonalizes them all. Unfortunately there is no closed form solution that diagonalizes more than two matrices at the same time and we have to resort to iterative algorithms for joint diagonalization.

Two of these algorithms are the qdiag method as described in Vollgraf & Obermayer (2006) and the ffdiag method as described in Ziehe et al. (2004).

Unfortunately the convergence criteria of these two algorithms cannot easily be compared as the criterion for the ffdiag algorithm is the relative change of the square root of the sum of the squared off-diagonal elements of the transformed cross-correlation matrices and the criterion for qdiag is the largest change in the eigenvectors norm during an iteration.

Example

We start by creating a speech synthesizer that need to create two sounds. We will mix the two sounds and finally our blind source separation software will try to undo our mixing by extracting the two original sounds as well as possible from the two mixtures.

    synth = Create SpeechSynthesizer: "English (Great Britain)", "Female1"
    s1 = To Sound: "This is some text", "no"

The first speech sound was created from the text "This is some text" at a speed of 175 words per minute.

    selectObject: synth
    Speech output settings: 44100, 0.01, 1.2, 1.0, 145, "IPA"
    Estimate speech rate from speech: "no"
    s2 = To Sound.: "Abracadabra, abra", "no"

The second sound "Abracadabra, abra" was synthesized at 145 words per minute with a somewhat larger pitch excursion (80) than the previous sound (50).

    plusObject: s1
    stereo = Combine to stereo

We combine the two separate sounds into one stereo sound because our blind source separation works on multichannel sounds only.

    mm = Create simple MixingMatrix: "mm", 2, 2, "1.0 2.0 2.0 1.0"

A two by two MixingMatrix is created.

    plusObject: stereo
    Mix

The last command, Mix, creates a new two-channel sound where each channel is a linear mixture of the two channels in the stereo sound, i.e. channel 1 is the sum of s1 and s2 with mixture strengths of 1 and 2, respectively. The second channel is also the sum of s1 and s2 but now with mixture strengths 2 and 1, respectively.

    To Sound (blind source separation): 0.1, 1, 20, 0.0002, 100, 0.001, "ffdiag"

The two channels in the new sound that results from this command contain a reasonable approximation of the two originating sounds.

In the top panel the two speech sounds "This is some text" and "abracadabra, abra". The middle panel shows the two mixed sounds while the lower panel shows the two sounds after unmixing.

The first two panels will not change between different sessions of praat. The last panel, which shows the result of the blind source separation, i.e. unmixing, will not always be the same because of two things. In the first place the unmixing always starts with an initialisation with random values of the parameters that we have to determine for the blind source separation. Therefore the iteration sequence will never be the same and the final outcomes might differ. In the second place, as was explained in the blind source separation manual, the unmixing is only unique up to a scale factor and a permutation. Therefore the channels in the unmixed sound do not necessarily correspond to the corresponding channel in our "original" stereo sound.

The complete script:

    syn = Create SpeechSynthesizer: "English (Great Britain)", "Female1"
    s1 = To Sound: "This is some text", "no"
    selectObject: syn
    Speech output settings: 44100, 0.01, 1.2, 1.0, 145, "IPA"
    Estimate speech rate from speech: "no"
    s2 = To Sound: "abracadabra, abra", "no"
    plusObject: s1
    stereo = Combine to stereo
    Select inner viewport: 1, 6, 0.1, 1.9
    Draw: 0, 0, 0, 0, "no", "Curve"
    Draw inner box
    mm = Create simple MixingMatrix: "mm", 2, 2, "1.0 2.0 2.0 1.0"
    plusObject: stereo
    mixed = Mix
    Select inner viewport: 1, 6, 2.1, 3.9
    Draw: 0, 0, 0, 0, "no", "Curve"
    Draw inner box
    unmixed = To Sound (bss): 0.1, 1, 20, 0.00021, 100, 0.001, "ffdiag"
    Select inner viewport: 1, 6, 4.1, 5.9
    Draw: 0, 0, 0, 0, "no", "Curve"
    Draw inner box
    removeObject: unmixed, syn, stereo, s1, s2, mixed, mm

© djmw 20190811