In a new paper in the European Journal for Philosophy of Science, I consider Fisher's criticism that the Neyman-Pearson approach to hypothesis testing relies on the assumption of “repeated sampling from the same population” (Rubin, 2020). This criticism is problematic for the Neyman-Pearson approach because it implies that test users need to know, for sure, what counts as the same or equivalent population as their current population. If they don't know what counts as the same or equivalent population, then they can't specify a procedure that would be able to repeatedly sample from this population, rather than from other non-equivalent populations, and without this specification Neyman-Pearson long run error rates become meaningless.
I argue that, by definition, researchers do not know for sure what are the relevant and irrelevant features of their current populations. For example, in a psychology study, is the population “1st year undergraduate psychology students” or, more narrowly, “Australian 1st year undergraduate psychology students” or, more broadly, “psychology undergraduate students” or, even more broadly, “young people,” etc.? Researchers can make educated guesses about the relevant and irrelevant aspects of their population. However, they must concede that those guesses may be wrong. Consequently, if a researcher imagines a long run of repeated sampling, then they must imagine that they would make incorrect decisions about their null hypothesis due to not only Type I errors and Type II errors, but also Type III errors - errors caused by accidentally sampling from populations that are substantively different to their underspecified alternative and null populations. As Dennis et al. (2019) recently explained, "the 'Type 3' error of basing inferences on an inadequate model family is widely acknowledged to be a serious (if not fatal) scientific drawback of the Neyman-Pearson framework."
To be clear, the Neyman-Pearson approach does consider Type III errors. However, it considers them outside of each long run of repeated sampling. It does not allow Type III errors to occur inside a long run of repeated sampling, where the sampling must always be from a correctly specified family of "admissible" populations (Neyman, 1977, p. 106; Neyman & Pearson, 1933, p. 294). In my paper, I argue that researchers are unable to imagine a long run of repeated sampling from the same or equivalent populations as their current population because they are unclear about the relevant and irrelevant characteristics of their current population. Consequently, they are unable to rule out Type III errors within their imagined long run.
Following Fisher, I contrast scientific researchers with quality controllers in industrial production settings. Unlike researchers, quality controllers have clear knowledge about the relevant and irrelevant characteristics of their populations. For example, they are given a clear and unequivocal definition of Batch 57 on a production line, and they don't consider re-conceptualizing Batch 57 as including or excluding other features. They also know which aspects of their testing procedure are relevant and irrelevant, and they are provided with precise quality control standards that allow them to know, for sure, their smallest effect size of interest. Consequently, the Neyman-Pearson approach is suitable for quality controllers because quality controllers can imagine a testing process that repeatedly draws random samples from the same population over and over again. In contrast, the Neyman-Pearson approach is not appropriate in scientific investigations because researchers do not have a clear understanding of the relevant and irrelevant aspects of their populations, their tests, or the smallest effect size that represents their population. Indeed, they are "researchers" because they are "researching" these things. Hence, it is researchers' self-declared ignorance and doubt about the nature of their populations that renders Neyman-Pearson long run error rates scientifically meaningless.
In my article, I also consider Pearson (1947) and Neyman's (1977) responses to Fisher's "repeated sampling" criticism, focusing in particular on how it affected their conceptualization of the alpha level (nominal Type I error rate).
Pearson (1947) proposed that the alpha level can be considered in relation to a "hypothetical repetition" of the same test. However, as discussed above, this interpretation is only appropriate when test users are sure about all of the equivalent and non-equivalent aspects of their testing method and population. By definition, test users who adopt the role of "researcher" are not 100% sure about these things. As Fisher (1956, p. 78) explained, for researchers, “the population in question is hypothetical,…it could be defined in many ways, and…the first to come to mind may be quite misleading.”
Echoing Neyman and Pearson (1933), Pearson (1947) also suggested that the alpha level can be interpreted as a personal "rule" that guides researchers’ behavior during hypothesis testing across substantively different populations. However, this interpretation fails to acknowledge that a researcher may break their personal rule and use different alpha levels in different testing situations. As Fisher (1956, p. 42) put it, "no scientific worker has a fixed level of significance at which from year to year, and in all circumstances, he rejects hypotheses."
Addressing the rule-breaking problem, Neyman (1977) agreed with Fisher (1956) that the same researcher could use different alpha levels in different circumstances. However, he proposed that a researcher's average alpha level could be viewed as an indicator of their typical Type I error rate. So, for example, if Jane has an average alpha level of .050, and Tina has an average alpha level of .005, then we know that Jane will make more Type I errors than Tina during the course of her research career. Critically, however, these two researchers’ average alpha levels tell us nothing about the Type I error rates of specific tests of specific hypotheses, and in that sense they are scientifically irrelevant. Hence, although Tina may have an average alpha of .005 over the course of her career, her alpha level for Test 1 of Hypothesis A may be .10, .001, or any other value. As scientists, rather than metascientists, we should be more interested in the nominal Type I error rate of Test 1 of Hypothesis A than in the typical Type I error rate of Tina.
I conclude that neither Neyman nor Pearson adequately rebutted Fisher’s “repeated sampling” criticism, and that their alternative interpretations of alpha levels are lacking. I then briefly outline Fisher’s own significance testing approach and consider how it avoids his "repeated sampling" criticism.