This is one of the tricks mentioned by Simmons et al. (2011).
In the following simulation, the p-value is measured after every 10 participants.
numberOfRounds = 10 # the number of rounds; after each round a hypothesis test is performed
numberOfParticipantsPerRound = 10 # the number of participants in each round
numberOfExperiments = 1e5 # the number of experiments
count = 0 # the number of experiments that turn out significant
for (experiment in 1 : numberOfExperiments) {
mu = 0.0 # some arbitrary true effect size
sigma = 1.0 # some arbitrary true standard deviation
data = NA # empty data to start with (NA = "not available")
for (round in 1 : numberOfRounds) {
newData = rnorm (numberOfParticipantsPerRound, mu, sigma)
data = c(data, newData)
p = t.test (data, mu = mu) $ p.value
if (p < 0.05) {
count = count + 1 # significance found...
break # ...so break off the experiment!
}
}
}
count
## [1] 20050
So there is a 20 percent chance that your p-value falls below 0.05, if the null hypothesis is true. A “Type I error rate” of 20 percent. This bad. If the null hypothesis is true, then by definition this chance should be 5 percent, not 20 percent.
So don’t add participants. Instead determine the number of participants in advance, or determine a stopping criterion in advance. Good stopping criteria are:
Analysis-driven stopping criteria do exist, for instance in the medical world, where life and death issues can play a role and it can be irresponsible to withhold treatment from the placebo group if it is clear that the drug is effective.
One of these was mentioned in the blog by Rolf Zwaan that Emiel van Miltenburg sent to all of us last night (that blog mentions both the Simmons paper and the Wagenmakers work!).
The criterion that Rolf Zwaan mentions was presented by Robert Frick (1998: “A better stopping rule for conventional statistical tests”, Behavior Research Methods, Instruments, & Computers). It involves the following procedure:
With this procedure, the Type-I error rate stays below 5 percent, as Frick proves and as can be illustrated with a simulation:
numberOfRounds = 10 # the number of rounds; after each round a hypothesis test is performed
numberOfParticipantsPerRound = 10 # the number of participants in each round
numberOfExperiments = 1e5 # the number of experiments
count = 0 # the number of experiments that turn out significant
for (experiment in 1 : numberOfExperiments) {
mu = 0.0 # some arbitrary true effect size
sigma = 1.0 # some arbitrary true standard deviation
data = NA # empty data to start with (NA = "not available")
for (round in 1 : numberOfRounds) {
newData = rnorm (numberOfParticipantsPerRound, mu, sigma)
data = c(data, newData)
p = t.test (data, mu = mu) $ p.value
if (p < 0.01) {
count = count + 1 # significance found...
break # ...so break off the experiment!
}
if (p > 0.36) {
break # break off the experiment without raising the success count
}
}
}
count
## [1] 3302
The Type-I error rate is 0.033 here. If you allow for even more participants than 100, it can become almost 0.05, but it will never be higher.
You can only report that \(\alpha = 0.05\). You cannot report p-values (you have measured 7 of them!) or confidence intervals (ditto). This renders this method poor at scientific fact finding, though perhaps good for decision making with medical treatments.