Sunday, 27 September 2020

“Repeated sampling from the same population?” A critique of Neyman and Pearson’s responses to Fisher

In a new paper in the European Journal for Philosophy of Science, I consider Fisher's criticism that the Neyman-Pearson approach to hypothesis testing relies on the assumption of “repeated sampling from the same population” (Rubin, 2020). This criticism is problematic for the Neyman-Pearson approach because it implies that test users need to know, for sure, what counts as the same or equivalent population as their current population. If they don't know what counts as the same or equivalent population, then they can't specify a procedure that would be able to repeatedly sample from this population, rather than from other non-equivalent populations, and without this specification Neyman-Pearson long run error rates become meaningless.

I argue that, by definition, researchers do not know for sure what are the relevant and irrelevant features of their current populations. For example, in a psychology study, is the population “1st year undergraduate psychology students” or, more narrowly, “Australian 1st year undergraduate psychology students” or, more broadly, “psychology undergraduate students” or, even more broadly, “young people,” etc.? Researchers can make educated guesses about the relevant and irrelevant aspects of their population. However, they must concede that those guesses may be wrong. Consequently, if a researcher imagines a long run of repeated sampling, then they must imagine that they would make incorrect decisions about their null hypothesis due to not only Type I errors and Type II errors, but also Type III errors - errors caused by accidentally sampling from populations that are substantively different to their underspecified alternative and null populations. As Dennis et al. (2019) recently explained, "the 'Type 3' error of basing inferences on an inadequate model family is widely acknowledged to be a serious (if not fatal) scientific drawback of the Neyman-Pearson framework."

To be clear, the Neyman-Pearson approach does consider Type III errors. However, it considers them outside of each long run of repeated sampling. It does not allow Type III errors to occur inside a long run of repeated sampling, where the sampling must always be from a correctly specified family of "admissible" populations (Neyman, 1977, p. 106; Neyman & Pearson, 1933, p. 294). In my paper, I argue that researchers are unable to imagine a long run of repeated sampling from the same or equivalent populations as their current population because they are unclear about the relevant and irrelevant characteristics of their current population. Consequently, they are unable to rule out Type III errors within their imagined long run.

Following Fisher, I contrast scientific researchers with quality controllers in industrial production settings. Unlike researchers, quality controllers have clear knowledge about the relevant and irrelevant characteristics of their populations. For example, they are given a clear and unequivocal definition of Batch 57 on a production line, and they don't consider re-conceptualizing Batch 57 as including or excluding other features. They also know which aspects of their testing procedure are relevant and irrelevant, and they are provided with precise quality control standards that allow them to know, for sure, their smallest effect size of interest. Consequently, the Neyman-Pearson approach is suitable for quality controllers because quality controllers can imagine a testing process that repeatedly draws random samples from the same population over and over again. In contrast, the Neyman-Pearson approach is not appropriate in scientific investigations because researchers do not have a clear understanding of the relevant and irrelevant aspects of their populations, their tests, or the smallest effect size that represents their population. Indeed, they are "researchers" because they are "researching" these things. Hence, it is researchers' self-declared ignorance and doubt about the nature of their populations that renders Neyman-Pearson long run error rates scientifically meaningless.

Quality controllers sample blocks of cheese from Batch 57. Scientists are not quality controllers.

In my article, I also consider Pearson (1947) and Neyman's (1977) responses to Fisher's "repeated sampling" criticism, focusing in particular on how it affected their conceptualization of the alpha level (nominal Type I error rate).

Pearson (1947) proposed that the alpha level can be considered in relation to a "hypothetical repetition" of the same test. However, as discussed above, this interpretation is only appropriate when test users are sure about all of the equivalent and non-equivalent aspects of their testing method and population. By definition, test users who adopt the role of "researcher" are not 100% sure about these things. As Fisher (1956, p. 78) explained, for researchers, “the population in question is hypothetical,…it could be defined in many ways, and…the first to come to mind may be quite misleading.” 

Echoing Neyman and Pearson (1933), Pearson (1947) also suggested that the alpha level can be interpreted as a personal "rule" that guides researchers’ behavior during hypothesis testing across substantively different populations. However, this interpretation fails to acknowledge that a researcher may break their personal rule and use different alpha levels in different testing situations. As Fisher (1956, p. 42) put it, "no scientific worker has a fixed level of significance at which from year to year, and in all circumstances, he rejects hypotheses."

Addressing the rule-breaking problem, Neyman (1977) agreed with Fisher (1956) that the same researcher could use different alpha levels in different circumstances. However, he proposed that a researcher's average alpha level could be viewed as an indicator of their typical Type I error rate. So, for example, if Jane has an average alpha level of .050, and Tina has an average alpha level of .005, then we know that Jane will make more Type I errors than Tina during the course of her research career. Critically, however, these two researchers’ average alpha levels tell us nothing about the Type I error rates of specific tests of specific hypotheses, and in that sense they are scientifically irrelevant. Hence, although Tina may have an average alpha of .005 over the course of her career, her alpha level for Test 1 of Hypothesis A may be .10, .001, or any other value. As scientists, rather than metascientists, we should be more interested in the nominal Type I error rate of Test 1 of Hypothesis A than in the typical Type I error rate of Tina.

I conclude that neither Neyman nor Pearson adequately rebutted Fisher’s “repeated sampling” criticism, and that their alternative interpretations of alpha levels are lacking. I then briefly outline Fisher’s own significance testing approach and consider how it avoids his "repeated sampling" criticism.

Sunday, 20 September 2020

Does Preregistration Improve the Interpretablity and Credibility of Research Findings?

Preregistration entails researchers registering their planned research hypotheses, methods, and analyses in a time-stamped document before they undertake their data collection and analyses. This document is then made available with the published research report in order to allow readers to identify discrepancies between what the researchers originally planned to do and what they actually ended up doing. In a recent article (Rubin, 2020), I question whether this historical transparency facilitates judgments of credibility over and above what I call the contemporary transparency that is provided by (a) clear rationales for current hypotheses and analytical approaches, (b) public access to research data, materials, and code, and (c) demonstrations of the robustness of research conclusions to alternative interpretations and analytical approaches.
Historical Transparency
My article covers issues such as HARKing, multiple testing, p-hacking, forking paths, exploratory analyses, optional stopping, researchers’ biases, selective reporting, test severity, publication bias, and replication rates. I argue for a nuanced approach to these issues. In particular, I argue that only some of these issues are problematic, and only under some conditions. I also argue that, when they are problematic, these issues can be identified via contemporary transparency per se.
Questionable Research Practices
I conclude that preregistration’s historical transparency does not facilitate judgments about the credibility of research findings when researchers provide contemporary transparency. Of course, in many cases, researchers do not provide a sufficient degree of contemporary transparency (e.g., no open research data or materials), and in these cases preregistration’s historical transparency may provide some useful information. 
However, I argue that historical transparency is a relatively narrow, researcher-centric form of transparency because it focuses attention on the predictions made by specific researchers at a specific point in time. In contrast, contemporary transparency allows research data to be considered from multiple, unplanned, theoretical and analytical perspectives while maintaining a high degree of research credibility. Hence, I suggest that the open science movement should push more towards contemporary transparency and less towards historical transparency.
Contemporary Transparency

Friday, 13 March 2020

What Type of Type I Error?


In a recent paper (Rubin, 2019), I consider two types of replication in relation to two types of Type I error probability. First, I consider the distinction between exact and direct replications. Exact replications duplicate all aspects of a study that could potentially affect the original result. In contrast, direct replications duplicate only those aspects of the study that are thought to be theoretically essential to reproduce the original result.

Second, I consider two types of Type I error probability. The Neyman-Pearson Type I error rate refers to the maximum frequency of incorrectly rejecting a null hypothesis if a test was to be repeatedly reconducted on a series of different random samples that are all drawn from the exact same null population. Hence, the Neyman-Pearson Type I error rate refers to a long run of exact replications. In contrast, the Fisherian Type I error probability is the probability of incorrectly rejecting a null hypothesis in relation to a hypothetical population that reflects the relevant characteristics of the particular sample under consideration. Hence, the Fisherian Type I error rate refers to a one-off sample rather than a series of samples that are drawn during a long run of exact replications.

I argue that social science deals with units of analysis (people, social groups, and social systems) that change over time. As the Greek philosopher Heraclitus put it: “No man ever steps in the same river twice, for it’s not the same river and he’s not the same man.” Rivers and men are what Schmidt (2009, p. 92) called "irreversible units" in that they are complex time-sensitive systems that accumulate history. The scientific investigation of these irreversible units cannot proceed on the assumption that exact replications are possible. Consequently, the Neyman-Pearson Type I error rate is irrelevant in social science, because it relies on a concept of exact replication that cannot take place in the case of people, social groups, and social systems. Why should social scientists be interested in an error rate for an impossible process of resampling from the same fixed and unchanging population?

"No [wo]man ever steps in the same river twice, for it’s not the same river and [s]he’s not the same [wo]man" ( Heraclitus)
I argue that the Fisherian Type I error probability is more appropriate in social science because it refers to one-off samples from hypothetical populations. In this case, researchers recognise that every sample comes from a potentially different population. Hence, researchers can apply the Fisherian Type I error probability to each sample-specific provisional decision that they make about rejecting the same substantive null hypothesis in a series of direct replications.

I conclude that the replication crisis may be partly (not wholly) due to researchers’ unrealistic expectations about replicability based on their consideration of the Neyman-Pearson Type I error rate across a long run of exact replications.

For further information, please see:

Rubin, M. (2019). What type of Type I error? Contrasting the Neyman-Pearson and Fisherian approaches in the context of exact and direct replications. Synthese.   *Publisher’s version*    *Self-archived version*

Thursday, 12 March 2020

Do p Values Lose their Meaning in Exploratory Analyses?

In Rubin (2017), I consider the idea that p values lose their meaning (become invalid) in exploratory analyses (i.e., non-preregistered analyses). I argue that this view is correct if researchers aim to control a familywise error rate that includes all of the hypotheses that they have tested, or could have tested, in their study (i.e., a universal, experimentwise, or studywise error rate). In this case, it is not possible to compute the required familywise error rate because the number of post hoc hypotheses that have been tested, or could have been tested, during exploratory analyses in the study is unknown. However, I argue that researchers are rarely interested in a studywise error rate because they are rarely interested in testing the joint studywise hypothesis to which this error rate refers.
For example, imagine that a researcher conducted a study in which they explored the associations between body weight and (1) gender, (2) age, (3) ethnicity, and (4) social class. This researcher is unlikely to be interested in a studywise null hypothesis that can be rejected following a significant result for any of their four tests, because this joint null hypothesis is unlikely to relate to any meaningful theory. Which theory proposes that gender, age, ethnicity, and social class all predict body weight for the same theoretical reason? And, if the researcher is not interested in making a decision about the studywise null hypothesis, then there is no need for them to lower the alpha level (α; the significance threshold) for each of their four tests (e.g., from α = .050 to α = .050/4 or .0125) in order to maintain the Type I error rate for their decision about the studywise hypothesis at α = .050. Instead, the researcher can test each of the four different associations individually (i.e., each at α = .050) in order to make a separate, independent claim about each of four theoretically independent hypotheses (e.g., "male participants weighed more than female participants, p = .021"). By analogy, a woman who takes a pregnancy test does not need to worry about the familywise error rate that either her pregnancy test, her fire alarm, or her email spam filter will yield a false positive result because the associated joint hypothesis is nonsensical.
Sometimes it doesn't make sense to combine different
hypotheses as part of the same family!
Researchers should only be concerned about the familywise error rate of a set of tests when that set refers to the same theoretically meaningful joint hypothesis. For example, a researcher who undertakes exploratory analyses should be concerned about the familywise error rate for the hypothesis that men weigh more than women if they use four different measures of weight, and they are prepared to accept a single significant difference on any of those four measures as grounds for rejecting the associated joint null hypothesis. In this case, they should reduce their alpha level for each constituent test (e.g., to α/4) in order to maintain their nominal Type I error rate for the joint hypothesis at α. Based on this reasoning, I argue that p values do not lose their meaning in exploratory analyses because (a) researchers are not usually interested in the studywise error rate, and (b) they are able to transparently and verifiably specify and control the familywise error rates for any theoretically meaningful post hoc joint hypotheses about which they make claims.
I also recommend that researchers undertake a few basic open sciences practices during exploratory analyses in order to alleviate concerns about potential p-hacking: (1) List all of the variables in the research study. (2) Undertake a sensitivity analysis to demonstrate that the research results are robust to alternative analytical approaches. (3) Make the research data and materials publicly available to allow readers to check whether the results for any relevant measures have been omitted from the research report.

For further information, please see:
Rubin, M. (2017). Do p values lose their meaning in exploratory analyses? It depends how you define the familywise error rate. Review of General Psychology, 21, 269-275. *Publisher’s version* *Self-archived version*

Wednesday, 11 March 2020

The Costs of HARKing: Does it Matter if Researchers Engage in Undisclosed Hypothesizing After the Results are Known?

While no-one is looking, a Texas sharpshooter fires his gun at a barn wall. He then walks up to his bullet holes and paints targets around them. When his friends arrive, he points at the targets and claims that he’s a good shot (de Groot, 2014; Rubin, 2017b). In 1998, Norbert Kerr discussed an analogous situation in which researchers engage in undisclosed hypothesizing after the results are known or HARKing. In this case, researchers conduct statistical tests, observe their research results (bullet holes), and then construct post hoc predictions (paint targets) to fit these results. In their research reports, they then pretend that their post hoc hypotheses are actually a priori hypotheses. This questionable research practice is thought to have contributed to the replication crisis in science (e.g., Shrout & Rodgers, 2018), and it provides part of the rationale for researchers to publicly preregister their hypotheses ahead of conducting their research (Wagenmakers et al., 2012). In a recent BJPS article, I discuss the concept of HARKing from a philosophical standpoint and then undertake a critical analysis of Kerr’s 12 potential costs of HARKing.


Source: Dirk-Jan Hoek. https://www.flickr.com/photos/23868780@N00/7374874302
I begin my article by arguing that scientists do not make absolute, dichotomous judgements about theories and hypotheses being “true” or “false.” Instead, they make relative judgements about theories and hypotheses being more or less true that other theories and hypotheses in accounting for certain phenomena. Such judgements can be described as estimates of relative verisimilitude (Cevolani & Festa, 2018).
I then note that HARKers are obliged to provide a theoretical rationale for each of their secretly post hoc hypotheses in the Introduction sections of their research reports. Despite being secretly post hoc, this theoretical rationale provides a result-independent basis for an initial estimate of the relative verisimilitude of a hypothesis. The reported research results can then provide a second, epistemically independent basis for adjusting this initial estimate (for a similar view, see Lewandowsky, 2019; Oberauer & Lewandowsky, 2019). Hence, readers can estimate the relative verisimilitude of a hypothesis (a) without taking the current result into account and (b) after taking the current result into account, even if they have been misled about when researchers constructed the hypothesis. Consequently, readers are able to undertake a valid counterfactual updating of their estimated relative verisimilitude of a hypothesis even though HARKing has occurred. Importantly, there is no “double-counting” (Mayo, 2008) or violation of the use novelty principle here (Worrall, 1985, 2014), because the current result contributes new information to an initial estimate of relative verisimilitude that has been generated in a result-independent manner.
To translate this reasoning to the Texas sharpshooter analogy, it is necessary to distinguish HARKing from p-hacking. If our sharpshooter painted a new target but retained his substantive claim that he is “a good shot,” then he would be similar to a researcher who conducted multiple statistical tests and then selectively reported only those results that supported their original a priori substantive hypothesis. Frequentist researchers would describe this researcher as a “p-hacker” rather than a HARKer (Rubin, 2017b, p. 325; Simmons et al., 2011). To be a HARKer, researchers must also change their original a priori hypothesis or create a totally new one. Hence, a more appropriate analogy is to consider a sharpshooter who changes both their statistical hypothesis (their target's location) and their broader substantive hypothesis (their claim).
For example, another sharpshooter, Jane, might initially believe “I’m a good shot” but, after seeing that she has missed the target that she was aiming for, she secretly paints a target around her stray bullet hole and declares to her friends: “I’m a good shot, but I can’t adjust for windy conditions.” Based on their a priori knowledge about Jane, her friends should be able to form an initial opinion about the verisimilitude of this claim (e.g., "Jane's always trained indoors. So, we can deduce that she hasn't learned to adjust for windy conditions.") To support her claim, Jane provides her friends with accurate procedural information about her shot (i.e., open research data and materials), including (a) the direction in which she was aiming her gun when she took the shot and (b) the speed and direction of the wind at the time of her shot. Her friends are then able to combine this procedural information with a priori theoretical information about the way in which gun shots are affected by the wind in order to calculate the predicted location of Jane’s bullet hole in a result-independent manner. They observe that this predicted location matches the location of Jane's bullet hole and (newly painted) target. Based on this match, they are warranted to increase their belief in the (secretly HARKed) hypothesis that Jane is a good shot but cannot adjust for windy conditions.
We can predict the location of the sharpshooter's bullet hole on the basis of her (secretly HARKed) hypothesis that she is a good shot but cannot adjust for windy conditions. We can then use the location of the bullet hole to increase or decrease our estimated relative verisimilitude for this prediction. Source: https://pixabay.com/photos/woman-rifle-shoot-gun-weapon-2577104/
I should note that my current paper contradicts one of the points that I made in my 2017b article "When does HARKing Hurt?" In that previous article, I argued that a result cannot be used to support a hypothesis if it has already been used to construct that hypothesis. In the current paper, I argue that even a result that has been used to construct a hypothesis can support that hypothesis if the hypothesis can be reconstructed on the basis of a priori theory and evidence that is epistemically independent from that result. The fact that the inspiration for a hypothesis can be, or has been, influenced by a result doesn't mean that it can't also be deduced independent from that result. And, if a hypothesis can be deduced independent from a result, then the result can be used to update an initial estimate of relative verisimilitude that is based on that deduction. HARKing is only problematic in the case of ad hoc accommodation, in which the rationale for the hypothesis or model is induced from the current data per se rather than deduced from a priori theory and evidence. In this case, there is no result-independent basis for establishing an initial estimate of relative verisimilitude.
The second part of my paper provides a critical analysis of Kerr’s (1998) 12 costs of HARKing. For further information, please see:
Rubin, M. (2019). The costs of HARKing. The British Journal for the Philosophy of Science. https://doi.org/10.1093/bjps/axz050 *Publisher’s free access* *Self-archived version*
References
Cevolani, G., & Festa, R. (2018). A partial consequence account of truthlikeness. Synthesehttp://dx.doi.org/10.1007/s11229-018-01947-3
de Groot, A. D. (2014). The meaning of “significance” for different types of research (E. J. Wagenmakers, D. Borsboom, J. Verhagen, R. Kievit, M. Bakker, A. Cramer, . . . H. L. J. van der Maas). Acta Psychologica, 148, 188–194. http://dx.doi.org/10.1016/j.actpsy.2014.02.001
Kerr, N. L. (1998). HARKing: Hypothesizing after the results are known. Personality and Social Psychology Review, 2, 196-217. http://dx.doi.org/10.1207/s15327957pspr0203_4
Lewandowsky, S. (2019). Avoiding Nimitz Hill with more than a little red book: Summing up #PSprereg. https://featuredcontent.psychonomic.org/avoiding-nimitz-hill-with-more-than-a-little-red-book-summing-up-psprereg/
Mayo, D. G. (2008). How to discount double-counting when it counts: Some clarifications. The British Journal for the Philosophy of Science, 59, 857–879.
Oberauer, K., & Lewandowsky, S., (2019). Addressing the theory crisis in psychology. Psychonomic Bulletin & Reviewhttp://dx.doi.org/10.3758/s13423-019-01645-2
Rubin, M. (2017a). An evaluation of four solutions to the forking paths problem: Adjusted alpha, preregistration, sensitivity analyses, and abandoning the Neyman-Pearson approach. Review of General Psychology, 21, 321-329. http://dx.doi.org/10.1037/gpr0000135 *Self-archived version*
Rubin, M. (2017b). When does HARKing hurt? Identifying when different types of undisclosed post hoc hypothesizing harm scientific progress. Review of General Psychology, 21, 308-320. http://dx.doi.org/10.1037/gpr0000128 *Self-archived version*
Shrout, P. E., & Rodgers, J. L. (2018). Psychology, science, and knowledge construction: Broadening perspectives from the replication crisis. Annual Review of Psychology, 69, 487-510. http://dx.doi.org/10.1146/annurev-psych-122216-011845
Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22, 1359–1366. http://dx.doi.org/10.1177/0956797611417632
Wagenmakers, E. J., Wetzels, R., Borsboom, D., van der Maas, H. L., & Kievit, R. A. (2012). An agenda for purely confirmatory research. Perspectives on Psychological Science, 7, 632-638. http://dx.doi.org/10.1177/1745691612463078
Worrall, J. (1985). Scientific discovery and theory-confirmation. In J. C. Pitt (Ed.), Change and progress in modern science: Papers related to and arising from the Fourth International Conference on History and Philosophy of Science (pp. 301–331). Dordrecht, the Netherlands: Reidel. http://dx.doi.org/10.1007/978-94-009-6525-6_11
Worrall, J. (2014). Prediction and accommodation revisited. Studies in History and Philosophy of Science, 45, 54–61. http://dx.doi.org/10.1016/j.shpsa.2013.10.001

Citation: Rubin, M. (2020, March 11). The costs of HARKing: Does it matter if researchers engage in undisclosed hypothesising after the results are known? http://bit.ly/2DMJbmt

Tuesday, 13 June 2017

The Effects of Sexism on Women Miners' Mental Health and Job Satisfaction



In the mining industry, women make up only 19.4% of the workers in Canada, 16.4% in Australia, and 13.3% in the USA (Catalyst, 2015). In the present research, we investigated women’s experiences of sexism in this male-dominated industry and how these experiences related to women’s mental health and job satisfaction.

We surveyed 263 women miners from an Australian-based mining company that has operations in Australia, Africa, South America, and South East Asia. Participants responded to items about sexism, sense of belonging, mental health, and job satisfaction. 

Our research focused on two types of sexism: organizational sexism and interpersonal sexism. Organizational sexism refers to structural inequalities in an organization that are connected with opportunities for promotion and career progression, job stability, training, pay, competence, work-life balance, and performance standards. We found that women miners who felt relatively disadvantaged on these dimensions reported poorer mental health and job satisfaction. Hence, a potential strategy to improve women miners’ mental health and job satisfaction may be to reduce their perceived and actual disadvantage on these dimensions. This might be achieved through a combination of structural changes in the workplace (e.g., more opportunities for women miners’ career progression) and/or greater transparency in the gender-based similarities on these dimensions (e.g., publication of workforce statistics demonstrating equality of pay).

Interpersonal sexism refers to inappropriate images of women in the workplace, sexual harassment, and sexist comments. Like organizational sexism, interpersonal sexism was negatively related to mental health and job satisfaction. Interpersonal sexism is more ingrained in wider intergender relations in society, and addressing interpersonal sexism effectively is likely to require a partnership between employers and (male and female) employees.

A third variable that was associated with women miners' mental health and job satisfaction was sense of belonging in the industry. This variable mediated the effects of organizational sexism on job satisfaction. Hence, an additional approach towards improving women miners’ job satisfaction may be to increase their sense of belonging. An increased sense of belonging may be achieved by promoting community events both within the female group of miners (i.e., as a group of “women miners”) and within the industry as whole (i.e., women identifying as “miners”).

We also found some interesting cross-country differences. Women who worked at Australian mine sites reported significantly less organizational and interpersonal sexism and fewer mental health problems than did women who worked at African, South American, and South East Asian worksites. These differences may reflect cross-cultural differences, with Australia’s more progressive Western culture prescribing less sexism and better mental health practices in the workplace.

It is important to note that our study’s cross-sectional correlational design prevents clear conclusions regarding the causal direction of the associations between the variables that we studied. Future research may wish to use longitudinal research designs to address this issue.

For further information about this research, please see the following journal article:

Rubin. M., Subasic, E., Giacomini, A., & Paolini, S. (2017). An exploratory study of the relations between women miners’ gender-based workplace issues and their mental health and job satisfaction. Journal of Applied Social Psychology. doi: 10.1111/jasp.12448 

For an open access self-archived version, please click here.