Principal component analysis

Principal component analysis

This tutorial describes how you can perform principal component analysis with Praat.

Principal component analysis (PCA) involves a mathematical procedure that transforms a number of (possibly) correlated variables into a (smaller) number of uncorrelated variables called principal components. The first principal component accounts for as much of the variability in the data as possible, and each succeeding component accounts for as much of the remaining variability as possible.

1. Objectives of principal component analysis

• To discover or to reduce the dimensionality of the data set.

• To identify new meaningful underlying variables.

2. How to start

We assume that the multi-dimensional data have been collected in a TableOfReal data matrix, in which the rows are associated with the cases and the columns with the variables. The TableOfReal is therefore interpreted as numberOfRows data vectors, each data vector has numberofColumns elements.

Traditionally, principal component analysis is performed on the Covariance matrix or on the Correlation matrix. These matrices can be calculated from the data matrix. The covariance matrix contains scaled sums of squares and cross products. A correlation matrix is like a covariance matrix but first the variables, i.e. the columns, have been standardized. We will have to standardize the data first if the variances of variables differ much, or if the units of measurement of the variables differ. You can standardize the data in the TableOfReal by choosing Standardize columns.

To perform the analysis, we select the TabelOfReal data matrix in the list of objects and choose To PCA. This will result in a new PCA object in the list of objects.

We can now make a scree plot of the eigenvalues, Draw eigenvalues... to get an indication of the importance of each eigenvalue. The exact contribution of each eigenvalue (or a range of eigenvalues) to the "explained variance" can also be queried: Get fraction variance accounted for.... You might also check for the equality of a number of eigenvalues: Get equality of eigenvalues....

3. Determining the number of components to keep

There are two methods to help you to choose the number of components to keep. Both methods are based on relations between the eigenvalues.

• Plot the eigenvalues, Draw eigenvalues.... If the points on the graph tend to level out (show an "elbow"), these eigenvalues are usually close enough to zero that they can be ignored.

• Limit the number of components to that number that accounts for a certain fraction of the total variance. For example, if you are satisfied with 95% of the total variance explained, then use the number you get by the query Get number of components (VAF)... 0.95.

4. Getting the principal components

Principal components are obtained by projecting the multivariate datavectors on the space spanned by the eigenvectors. This can be done in two ways:

1. Directly from the TableOfReal without first forming a PCA object: To Configuration (pca).... You can then draw the Configuration or display its numbers.

2. Select a PCA and a TableOfReal object together and choose To Configuration.... In this way you project the TableOfReal onto the PCA's eigenspace.

5. Mathematical background on principal component analysis

The mathematical technique used in PCA is called eigen analysis: we solve for the eigenvalues and eigenvectors of a square symmetric matrix with sums of squares and cross products. The eigenvector associated with the largest eigenvalue has the same direction as the first principal component. The eigenvector associated with the second largest eigenvalue determines the direction of the second principal component. The sum of the eigenvalues equals the trace of the square matrix and the maximum number of eigenvectors equals the number of rows (or columns) of this matrix.

6. Algorithms

If our starting point happens to be a symmetric matrix like the covariance matrix, we solve for the eigenvalue and eigenvectors by first performing a Householder reduction to tridiagonal form, followed by the QL algorithm with implicit shifts.