.

.

.

.

.

.

.

.

Example: a t-test between two groups

Assume that 10 speakers are randomly drawn from the Dutch and Flemish populations. Their scores on a German-language proficiency test are measured:

dutch = c(34, 45, 33, 54, 45, 23, 66, 35, 24, 50)
flemish = c(67, 50, 64, 43, 47, 50, 68, 56, 60, 39)

The Flemish look a bit better than the Dutch. Is this difference statistically significant?

t.test (dutch, flemish, var.equal=TRUE)
## 
##  Two Sample t-test
## 
## data:  dutch and flemish
## t = -2.512, df = 18, p-value = 0.02176
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -24.790596  -2.209404
## sample estimates:
## mean of x mean of y 
##      40.9      54.4

.

.

.

Question: Is this p-value of 0.02176 a probability?

  1. yes
  2. no
  3. don’t know

.

.

.

.

.

.

Yes, it is a probability. But of what?

.

.

.

Question: Is this p-value of 0.02176…

  1. the probability that the Dutch and Flemish population means are the same?
  2. the probability that if the Dutch and Flemish population means are the same, an absolute t-value of 2.512 or greater will be found between random Dutch and Flemish samples of size 10?
  3. don’t know

.

.

.

.

.

.

So yes, the p-value of 0.02176 is a tail probability: the probability that if the Dutch and Flemish population means are the same, an absolute t-value of 2.512 or greater will be found between random Dutch and Flemish samples of size 10.

.

.

.

Question: What is this condition (“the Dutch and Flemish population means are the same”) usually called?

  1. the null hypothesis
  2. the alternative hypothesis
  3. don’t know

.

.

.

.

.

.

It’s the null hypothesis (one can have null hypotheses that are not about equality, but the one used here is typical).

Let’s check that the probability of finding an absolute t above 2.512 is indeed around 2.2 percent, by drawing a hundred thousand random samples from the Dutch and Flemish populations.

In the simulation, we assume that the null hypothesis is true, so that:

Here is how you obtain the data for one such an experiment (not 100,000 yet):

numberOfParticipantsPerGroup = 10
mu.d = mu.f = 50
sigma.d = sigma.f = 8
data.d = rnorm (numberOfParticipantsPerGroup, mu.d, sigma.d)
data.f = rnorm (numberOfParticipantsPerGroup, mu.f, sigma.f)
data.d
data.f
##  [1] 46.01451 53.06159 58.37237 55.26379 43.74593 49.95321 55.10986
##  [8] 45.29203 47.70117 48.67666
##  [1] 62.36448 54.69570 59.98880 62.60507 47.95375 51.00754 48.28727
##  [8] 44.34768 58.20126 48.39343

You get the t-value from such data as follows:

numberOfParticipantsPerGroup = 10
mu.d = mu.f = 50
sigma.d = sigma.f = 8
data.d = rnorm (numberOfParticipantsPerGroup, mu.d, sigma.d)
data.f = rnorm (numberOfParticipantsPerGroup, mu.f, sigma.f)
t = t.test (data.d, data.f, var.equal=TRUE) $ statistic
t
##        t 
## 0.286349

Then try this a hundred thousand times, and see how often the absolute t-value reaches 2.512 or more:

numberOfParticipantsPerGroup = 10
numberOfExperiments = 1e5
count = 0
for (experiment in 1 : numberOfExperiments) {
  mu.d = mu.f = 50
  sigma.d = sigma.f = 8
  data.d = rnorm (numberOfParticipantsPerGroup, mu.d, sigma.d)
  data.f = rnorm (numberOfParticipantsPerGroup, mu.f, sigma.f)
  t = t.test (data.d, data.f, var.equal=TRUE) $ statistic
  if (abs (t) > 2.512) {
    count = count + 1
  }
}
count
## [1] 2213

That’s very close to the 0.02176 probability computed by t.test. So, R’s t.test seems to compute the p-value correctly!

.

.

.

Question: What can we conclude from the p-value of 0.02176, which is less than the common criterion of 0.05?

  1. we reject the null hypothesis, so the Flemish group is better than the Dutch group
  2. we reject the null hypothesis, so Flemish people are better than Dutch people
  3. don’t know

.

.

.

.

.

.

What if one participant had been different (a score of 27 instead of 47)?

dutch = c(34, 45, 33, 54, 45, 23, 66, 35, 24, 50)
flemish = c(67, 50, 64, 43, 27, 50, 68, 56, 60, 39)
t.test (dutch, flemish, var.equal=T)
## 
##  Two Sample t-test
## 
## data:  dutch and flemish
## t = -1.9122, df = 18, p-value = 0.07191
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -24.13526   1.13526
## sample estimates:
## mean of x mean of y 
##      40.9      52.4

The p-value is now greater than 0.05.

.

.

.

Question: What can we conclude?

  1. we cannot reject the null hypothesis, so we cannot conclude anything
  2. we accept the null hypothesis, so Dutch people are as good as Flemish people
  3. don’t know

.

.

.

.

.

.

Neyman–Pearson statistics

I now very briefly describe a way of doing statistics that I will not recommend for scientific fact finding. It is quite usable for binary decisions in real life, though, such as the question whether or not to take a new drug to market.

The “Type I” error rate is

\(\alpha\) = 0.05

qt (p = 0.025, df = 18)
## [1] -2.100922
qt (p = 0.975, df = 18)
## [1] 2.100922

If the null hypothesis (no difference between Dutch and Flemish population means) is true, the absolute t-value will be found to lie above 2.101 5 percent of the time (for 10 participants per group).

The “Type II” error rate is

\(\beta\) = 0.20

If the true population differs by e.g. 4 (or any other number considered to be the threshold of importance), the absolute t-value will be found to lie above 2.101 80 percent of the time (for 10 participants per group).

The recipe for this methodology in e.g. drug treatment is:

  1. Measure the effect of the drug in 10 people, and the effect of placebo in 10 other people.
  2. If the difference between the groups yields a t-value above 2.101 in favour of the drug, reject the null hypothesis and make the drug officially available to future patients.
  3. If the difference between the groups yields a t-value below 2.101 in favour of the drug, “accept” the null hypothesis and take the drug off the market.

Values for \(\alpha\) and \(\beta\) will depend on:

  • The severity of side effects.
  • The size of the potential cure (life-saving?).
  • The cost of the drug.
  • And so on.

.

.

.

Question: Are there any p-values involved in this procedure?

  • yes
  • no
  • don’t know

.

.

.

.

.

.

No. I didn’t mention any p-values in my very brief discussion of Neyman–Pearson, and they are indeed not used in that way of doing statistics. One either rejects the null hypothesis if t is in the “critical region”, or accepts the null hypothesis if t is smaller. One only reports the “significance level” \(\alpha\) that is used to demarcate the critical region, and one reports that \(\alpha\) is 0.05.

.

.

.

Question: Is this a good procedure for scientific belief finding?

  • yes, because I need a clear criterion for my beliefs
  • no, because the strength of my belief will gradiently depend on the p-value
  • don’t know

.

.

.

.

.

.

I already suggested that it’s not. Surely the difference between a p-value of 0.049 and one of 0.051 is not impressive enough for me to wholly believe that the effect is real in the former case and that the effect is zero in the latter case. You can read about this in the free article Erroneous analyses of interactions in neuroscience: a problem of significance by Sander Nieuwenhuis, Birte Forstmann & Eric-Jan Wagenmakers (2011, Nature Neuroscience).

Fisher quotes

Ronald Fisher also had some things to say about p-values above 0.05, namely that you should ignore them (they are null results, and a null result is the same as no result):

The test of significance […] tells [the researcher] what to ignore, namely all experiments in which significant results are not obtained. (Fisher 1929, p. 191)

The null hypothesis is never proved or established, but is possibly disproved, in the course of experimentation. (Fisher 1966, p. 16)

Can you mix Fisher’s and Neyman–Pearson’s approaches?

Strict statisticians will tell you that you cannot mix the two approaches. Often you see mixes. For instance, in a paper that uses \(\alpha\) and \(\beta\) (and therefore did a power analysis), people sometimes report a p-value, especially if it is low. Please don’t. And in papers that generally report p-values, don’t say things like “significant with p < 0.05”, which is \(\alpha\)-terminology, but report the actual p-value that you obtained. Sometimes you see things like “effect A was significant with p < 0.05, and effect B was significant with p < 0.01”, or, equivalently, “effect A was significant at the 0.05 level, and effect B was significant at the 0.01 level.” Both are cases of “moving alpha”, a concept that generative linguists may like (in a very different meaning), but that statisticians of all convictions enthusiastically discourage. Practically, you just always report the p-value and state a conclusion about the population only if this p-value is below 0.05.

Today’s conclusion

In your research papers you should never compare p-values. If you find that your Spanish listeners improve significantly (p = 0.001), and your Portuguese listeners don’t improve significantly (p = 0.63), you can conclude that Spanish listeners (without “the”, i.e. the population on average) improve, but you cannot conclude that Portuguese listeners don’t improve. And you cannot even conclude that Spanish listeners improve more than Portuguese listeners. This is the single most common mistake in psychological and linguistic research papers (see Nieuwenhuis’ paper for counts, and any linguistic conference proceedings for examples).

What should you do instead? The only valid significance test is a direct comparison between the Spanish and Portuguese listeners. If every participant’s improvement can be measured by a single number, then the above two p-values were probably based on “t-tests against zero”, i.e. the mean improvement was significantly above zero for the Spanish group but not for the Portuguese group. The only valid significance test is a two-group t-test, as above in the Dutch–Flemish example. This directly compares the Spanish group with the Portuguese group, and if the result of this test is significant (i.e. p < 0.05), you conclude that Spanish listeners (without “the”) improve more than Portuguese listeners. So the report is never “Spanish listeners improve but Portuguese listeners don’t” but “Spanish listeners improve more than Portuguese listeners”, and it has to be based on a direct comparison between the two groups.

Or consider this one. For Brazilian Portuguese, you discover (by a two-group t-test) that male and female speakers are significantly different in their use of construction X (males use it more than females), whereas for Iberian Portuguese, your two-group t-test does not show a significant difference between male and female speakers. From these two tests, you cannot conclude anything about the difference between Brazilian and Iberian Portuguese. To test whether the male–female difference is different in Brazilian than in Iberian Portuguese, you perform an analysis of variance with two fixed factors, namely dialect and sex (and with the use of construction X as the dependent variable), and you examine the p-value for the interaction between dialect and sex. If this p-value is below 0.05, and the difference of the difference is in the right direction, you report something like “the male advantage (over females) for construction X is greater for Brazilian Portuguese than for Iberian Portuguese”, or, equivalently, “the Brazilian advantage (over Iberian) for construction X is greater for males than for females” (basically, you will try to simplify in such a manner the wording that you use to explain what the interaction is about; this may depend on what your main effects look like).

Practical advice

  1. Get many participants. If the effect exists and is not small, there’s a good chance you’ll detect it (p < 0.05).
  2. If p > 0.05, you haven’t detected the effect. However, you will be able to report a narrow confidence interval and therefore say that the effect is “small or zero”.

.

.

.

Other things you should not do to lower your p-value

Many tricks are in Simmons et al. (2011). Summary of advice: use the word “only” and don’t lie.

“Test until significant”

This is one of the tricks mentioned by Simmons et al.

num.rounds = 100   # the number of rounds; after each round a hypothesis test is performed
dN = 10   # the number of participants in each round
num.experiments = 10000   # the number of experiments

value = rep (0, num.experiments)
num.sig = 0   # the number of experiments that turn out significant
for (iexp in 1 : num.experiments) {
  for (iround in 1 : num.rounds) {
    N = dN * iround
        value [(N - dN + 1) : N] = rnorm (dN, 0, 1)
        average = sum (value [1 : N]) / N
        stdev = sqrt (sum ((value [1 : N] - average) ^ 2) / (N - 1))
        stderr = stdev / sqrt (N)
        if (abs (average) > stderr * qt (0.975, N - 1)) {
            num.sig = num.sig + 1
            break
        }
  }
}
cat (num.sig / num.experiments)
## 0.3866

So don’t do that. Instead determine the number of participants in advance, or determine a stopping criterion in advance.