This is one of the tricks mentioned by Simmons et al. (2011).

In the following simulation, the p-value is measured after every 10 participants.

numberOfRounds = 10   # the number of rounds; after each round a hypothesis test is performed
numberOfParticipantsPerRound = 10   # the number of participants in each round
numberOfExperiments = 1e5   # the number of experiments

count = 0   # the number of experiments that turn out significant
for (experiment in 1 : numberOfExperiments) {
  mu = 0.0   # some arbitrary true effect size
  sigma = 1.0   # some arbitrary true standard deviation
  data = NA   # empty data to start with (NA = "not available")
  for (round in 1 : numberOfRounds) {
    newData = rnorm (numberOfParticipantsPerRound, mu, sigma)
    data = c(data, newData)
    p = t.test (data, mu = mu) $ p.value
    if (p < 0.05) {
      count = count + 1   # significance found...
      break   # ...so break off the experiment!
    }
  }
}
count
## [1] 20050

So there is a 20 percent chance that your p-value falls below 0.05, if the null hypothesis is true. A “Type I error rate” of 20 percent. This bad. If the null hypothesis is true, then by definition this chance should be 5 percent, not 20 percent.

Practical advice

So don’t add participants. Instead determine the number of participants in advance, or determine a stopping criterion in advance. Good stopping criteria are:

  1. Set the number of participants you invite to a fixed number, e.g. 40. Then if a participant cannot do the task, remove him or her from the group. In your paper, you report inviting 40 participants and removing e.g. 3 of them, thus running the analyses with 37 participants.
  2. Set the number of participants that you want to analyze to a fixed number, e.g. 40. Then if a participant cannot do the task, ignore him or her and invite an extra participant. In your paper, you report wanting to analyze 40 participants, inviting e.g. 44, and removing 4 of them.

An amendment

Analysis-driven stopping criteria do exist, for instance in the medical world, where life and death issues can play a role and it can be irresponsible to withhold treatment from the placebo group if it is clear that the drug is effective.

One of these was mentioned in the blog by Rolf Zwaan that Emiel van Miltenburg sent to all of us last night (that blog mentions both the Simmons paper and the Wagenmakers work!).

The criterion that Rolf Zwaan mentions was presented by Robert Frick (1998: “A better stopping rule for conventional statistical tests”, Behavior Research Methods, Instruments, & Computers). It involves the following procedure:

  1. Run your first batch of e.g. 10 participants.
  2. Compute the p-value for the total number of participants.
  3. If your p-value is below 0.01, break off the procedure and report that the null hypothesis is rejected.
  4. If your p-value is above 0.36 or the number of participants has reached the maximum, e.g. 100, break off the procedure and report that the null hypothesis cannot be rejected.
  5. Run another batch of participants and go back to 2.

With this procedure, the Type-I error rate stays below 5 percent, as Frick proves and as can be illustrated with a simulation:

numberOfRounds = 10   # the number of rounds; after each round a hypothesis test is performed
numberOfParticipantsPerRound = 10   # the number of participants in each round
numberOfExperiments = 1e5   # the number of experiments

count = 0   # the number of experiments that turn out significant
for (experiment in 1 : numberOfExperiments) {
  mu = 0.0   # some arbitrary true effect size
  sigma = 1.0   # some arbitrary true standard deviation
  data = NA   # empty data to start with (NA = "not available")
  for (round in 1 : numberOfRounds) {
    newData = rnorm (numberOfParticipantsPerRound, mu, sigma)
    data = c(data, newData)
    p = t.test (data, mu = mu) $ p.value
    if (p < 0.01) {
      count = count + 1   # significance found...
      break   # ...so break off the experiment!
    }
    if (p > 0.36) {
      break   # break off the experiment without raising the success count
    }
  }
}
count
## [1] 3302

The Type-I error rate is 0.033 here. If you allow for even more participants than 100, it can become almost 0.05, but it will never be higher.

Question: If after the first batch p = 0.04, and after the seventh batch finally p = 0.0078, what do you report?

  1. p = 0.04 and the confidence interval
  2. p = 0.0078 and the confidence interval
  3. only that the effect was significant at \(\alpha = 0.05\)
  4. don’t know

.

.

.

.

.

.

You can only report that \(\alpha = 0.05\). You cannot report p-values (you have measured 7 of them!) or confidence intervals (ditto). This renders this method poor at scientific fact finding, though perhaps good for decision making with medical treatments.