back

Statistical inference on a database corpus

Test statistical modules

Listing

In the end, statistical inference should be based on a subset of the (compiled) data in the corpus. As every scientific problem requires its own methods, the listing option allows to extract the relevant information from the database. Using the two Group selections, and the parameter selections you can determine which data are listed. However, compiling a listing and entering it in a statistical program (e.g., PSPP) is a lot of work. Therefore, some preliminary statistics can be performed directly on the listings.

Frequency counts

All statistical inference starts with determining the size of the sample. The frequency of occurrence of records are printed, grouped on the attributes given in Group 1 (rows) and Group 2 (columns).

The use of Nuisance factors

Often, it is necessary to account for the effects of certain factors (e.g., speaker, context) that are not the focus of the question. It would be nice if these factors are taken account of but do not clutter the output. These are called nuisance factors and can be entered as Group 3. Of course, they will be shown in the header to warn you what has been calculated.

Mean and Standard deviation (ANOVA)

The mean standard deviation and frequency of occurrence of value 1 grouped on the attributes given in Group 1 (rows) and Group 2 (columns). You can choose whether pooled variances are used for calculating the marginals (row and column totals), or only the within cell variances (discount degrees of freedom for this). Nuisance facters are taken account of when using within cell variances (they are useless when using pooled variances).

Using these means and variances, a simple ANOVA is performed. This is just for fun. If you think it has any authority, please follow a remedial course on statistics.

Pearson Product Moment Correlation

The product moment correlations (Pearson R) calculated on value 1 versus value 2 grouped on the attributes given in Group 1 (rows) and Group 2 (columns). You can choose whether pooled variances are used for calculating the marginals (row and column totals), or only the within cell variances (discount 1 degree of freedom for each cell for this). The effect of using within cell variance is to normalize the cell means before calculating the correlations.

Normalized Pearson Product Moment Correlation

The product moment correlations (Pearson R) are calculated using normalized means (as with using the within cell variance) and using normalized variances. This is as if all cell values are converted to a standard normal distribution (Z-values). In practice, this means that the marginal values are calculated as the mean correlation coefficients, weighted with the cell degrees-of-freedom. For each cell, 2 degrees of freedom are discounted.

Corrected Means Analysis

The largest problem in corpus lingusitics is the unequal distribution of "observations". In general, observations (e.g., words, word classes, but also all kind of other factors) follow Zipf's law: Freq(Ei) ~ 1/i, that is, the frequency of occurrence of the ith most frequent element is proportional to 1/i. The paradoxal result is that the combined probability mass of all extremely rare elements (e.g., words) will be very large. For example, almost every few pages of a book will contain words that you won't find repeated in a whole shelf of books.

As a result of this skewed probability distribution, any attempt to get a balanced sample over any reasonably set of parameters (e.g., stress, accent, position in the word/sentence, Part-of-Speech) will falter. Statistical methods that use a fixed variance approach (e.g., MANOVA/ANOVA) require balanced samples, and are therefore inefficient when used on natural speech corpora.

Corrected Means Analysis is a generalization of the ANOVA that can be of some help (see References). The "Corrected Means" are calculated from the mean values of homogeneous subsets of realizations, i.e., sets for which the values of all relevant factors are equal. A table is constructed with the factor values for which the average is to be calculated as the row headings and all combinations of values of the other factors (nuisance factors) as column headings. Each cell contains the mean value of the "homogeneous" set of realizations that conform to the row and column factor values. The result is a table that is, in general, less than half-filled. Due to this extreme sparsity, standard statistical techniques (e.g., Factor Analysis, ANOVA, or MANOVA) will give results of only limited value.

To handle this sparsity, we model segmental values as: VAL(all factors) = A(row-factors) + B(column-factors), i.e., the value as a function of all relevant factors is the sum of the effects of the row factors and the effects of the column (nuisance) factors. That is, the influence of the row-factors is independent of the influence of the column factors. Under this assumption, the average, pair-wise, difference between corresponding cells in any two rows should only depend on the values of the row-factors involved, and not on the values of the column factors. This way it is possible to calculate the average pair-wise cell differences between all pairs of rows, using only pairs of cells from the same column for which there are realizations in both rows. The differences are weighted to account the variation in the number of realizations in each cell, the weight being w=1/(1/#Cell1 + 1/#Cell2). However, the exact form of the weighting function has little effect on the outcome, as long as the weights are related to the number of realizations in the cells.

The set of average differences between all pairs of rows constitutes a set of linear equations on the mean row values that can be solved using standard techniques (i.e., minimizing RMS-error with a Singular Value Decomposition, SVD). The results are the Corrected Mean value of the rows, relative to the overall mean value. For any fully balanced set of realizations, the result of this procedure would be identical to the raw means. Therefore, the corrected mean values can be interpreted as a least RMS-error approximation of "balanced" means with an unbalanced data set. The overall mean value of all realizations from which the corrected means are calculated is used to transform the relative durations to absolute durations.

The original mean row differences are calculated from pair-wise cell differences. The non-parametric Wilcoxon Matched-Pairs Signed-Ranks test (WMPSR) is used to test the statistical significance of the differences. Each pair of table cells is used as a single matched pair in the analysis, i.e., we do not look "inside" the table cells.

References: