Difference of two proportions

This page explains how you compute the significance of a difference between two proportions with a χ2 (chi-square) test.

1. Example of normal use

Suppose that you are interested in proving that for a certain experimental participant Task B is easier than Task A.You let the participant perform Task A 110 times, and she turns out to perform this task correctly 71 times. You also let her perform Task B 120 times, and she performs this task correctly 93 times. The following table summarizes the results of your experiment:


Task A7139

Task B9327

The null hypothesis is that both tasks are equally difficult for the participant and that the probability that she performs Task A correctly is equal to the probability that she performs Task B correctly.

To compute the probability that the observed proportions are at least as different as 93/120 and 71/110 if the null hypothesis is true, go to Report difference of two proportions in the Goodies menu and fill in the four values 71, 39, 93, and 27. The resulting two-tailed p is 0.04300, suggesting that the null hypothesis can be rejected and the two tasks are not equally difficult for the participant (if the possibility that Task A is easier for her than Task B can be ruled out a priori, then the resulting one-tailed p is 0.02150).

2. Example of incorrect use: areal features

An anonymous linguist once proposed that there was a causal relation between blood groups and the incidence of dental fricatives. He noticed that dental fricatives occurred mainly in languages whose speakers predominantly had blood group O. To prove his point, he tabulated 100 languages:

Has /θ/ or /ð/No dental fricatives

Group O2411

Group A or B2936

Since p < 0.05, the linguist regarded his hypothesis as being supported by the facts. However, this χ2 test assumes that the 100 languages are independent, but they are not. Two adjacent languages tend to correlate in their probability of having dental fricatives, and their speakers tend to correlate in their blood groups. Both are areal features, which undermine the independence assumed by the χ2 test. The actual null hypothesis that the test rejected was the combined hypothesis that dental fricatives correlate with blood group and that the 100 languages are independent.

Another anonymous linguist proposed that those Limburgian dialects that had lost their tone contrast compensated this by having larger vowel inventories. He drew up a table of the dialects of 100 villages:

Has toneHas no tone

Has over 25 vowels107

Has under 25 vowels803

This result is very significant (p < 10-4), but only shows that either there is a relation between tone and the number of vowels or that the dialects are not independent. And since adjacent dialects are arguably dependent both with respect to tone and the number of vowels, the statistical significance does not allow us to draw any conclusion about the relationship between tone and the number of vowels.

3. Example of problematic use: pooling participants

An anonymous student decided to do the Task A versus Task B experiment described above, but did not let one participant perform all the 230 tasks. Instead, she let 5 participants perform 46 tasks each (22 times task A, 24 times Task B). The pooled data were:


Task A7139

Task B10416

The resulting p is 0.00016. So what is the conclusion, if the measurements can clearly be dependent? Well, if the null hypothesis is that all five participants are equally good at Task A as at Task B, then this hypothesis can be rejected. The conclusion must be that these five participants have on average more trouble with Task A than with Task B. The student incorrectly concluded, however, that Task A was more difficult for the average population than Task B. In order to be able to draw such a conclusion, however, a different test would be required, namely one that takes into account that the five participants form a random sample from the total population. The simplest such test would be a sign test over the participants: count those participants who score better on Task A than on Task B and see whether this number is reliably less than 50 percent of all participants. For five participants, such a sign test would never reach significance at a two-tailed 5 percent level (2·0.55 = 0.0625).

4. Example of problematic use: pooling participants

Our purpose was to disprove the null hypothesis that listeners' perception does not depend on the language they think they hear. However, certain vowel tokens acoustically in between the Dutch /ɑ/ and the Dutch /ɔ/ were perceived 50 percent of the time as /ɑ/ and 50 percent of the time as /ɔ/ when Dutch learners of Spanish thought they were hearing Dutch, but 60 percent of the time as /ɔ/ when they thought they were hearing Spanish. The responses of 40 listeners, all of whom underwent both language modes, is combined in the following table:


Dutch mode200200

Spanish mode160240

The result was p = 0.0056, which reliably showed that these 40 listeners on average shifted their category boundary toward /ɑ/ when they thought that the language they were listening to was Spanish. The conclusion is that not all listeners were indifferent to the language mode, so that mode-dependent perception must exist. The explanation in this case was that the Spanish /a/ (which Dutch learners of Spanish identify with their /ɑ/) is more auditorily front than Dutch /ɑ/; in order to reject the null hypothesis that language modes exist but that their direction is random for each learner, i.e. the population average of the shift is zero, a separate test was required to show that the observed shift is representative of the population of Dutch learners of Spanish (this is easier to accomplish for 40 participants than for 5).

Links to this page

© ppgb 20090717