Tuesday, 25 April 2023

Questionable Metascience Practices

In this new article, I consider questionable research practices in the field of metascience. A questionable metascience practice (QMP) is a research practice, assumption, or perspective that's been questioned by several commentators as being potentially problematic for metascience and/or the science reform movement. I discuss 10 QMPs that relate to criticism, replication, bias, generalization, and the characterization of science. My aim is not to cast aspersions on the field of metascience but to encourage a deeper consideration of its more questionable research practices, assumptions, and perspectives.

10 QMPS

(1) Rejecting or ignoring self-criticism: Rejecting or ignoring criticisms of metascience and/or science reform

(2) Fast ‘n’ bropen scientific criticism: A quick, superficial, dismissive, and/or mocking style of scientific criticism

(3) Overplaying the role of replication: Assuming that replication is essential to science, and that it indexes “the truth”

(4) Unspecified replication rate targets: Assuming a replication rate is “too low” without specifying an “acceptable” rate

(5) Metabias: An unacknowledged bias towards explaining the replication crisis in terms of researcher bias

(6) A bias reduction assumption: Focusing on selective reporting as the primary form of researcher bias and assuming that it can be reduced without increasing other forms of bias

(7) Devaluing exploratory results: Devaluing an exploratory result as being more “tentative” than a confirmatory result without considering other relevant issues (e.g., quality of associated theory, methods, analyses, transparency)

(8) Presuming QRPs are problematic: Presuming that questionable research practices (e.g., HARKing) are always problematic research practices

(9) A focus on knowledge accumulation: Conceiving knowledge accumulation as the primary objective of science without considering (a) the role of specified ignorance or (b) different objectives in other philosophies of science

(10) Homogenizing science: Focusing on specific approaches (e.g., quantitative methods; replicable effects) as “the scientific method”

A Caveat

In my article, I stress that only some metascientists engage in some QMPs some of the time, and that these QMPs may not always be problematic. Research is required to estimate the prevalence and impact of QMPs. In the meantime, I think that QMPs should be viewed as invitations to ask questions about how we go about doing better metascience.

Further Information

The Article

Rubin, M. (2023). Questionable metascience practices. Journal of Trial and Error. https://doi.org/10.36850/mr4

Journal of Trial and Error Special Issue

My article is part of a special issue in the Journal of Trial and Error: “Consequences of the scientific reform movement: Is the scientific reform movement headed in the right direction?” https://journal.trialanderror.org/pub/callscientificreform/release/4

My Other Work in This Area

For my other work in the area of metascience, please see: https://sites.google.com/site/markrubinsocialpsychresearch/replication-crisis

Saturday, 27 August 2022

Exploratory hypothesis tests can be more compelling than confirmatory hypothesis tests

Researchers often distinguish between:

(1) Exploratory hypothesis tests - unplanned tests of post hoc hypotheses that may be based on the current results, and

(2) Confirmatory hypothesis tests - planned tests of a priori hypotheses that are independent from the current results

This distinction is supposed to be useful because exploratory results are assumed to be more “tentative” and “open to bias” than confirmatory results. In this recent paper, we challenge this assumption and argue that exploratory results can be more compelling than confirmatory results.

Our article has three parts. In the first part, we demonstrate that the same data can be used to generate and test a hypothesis in a transparently valid manner. We agree that circular reasoning can invalidate some exploratory hypothesis tests. However, circular reasoning can be identified by checking the contents of the reasoning without knowing the timing of that reasoning (i.e., a priori or post hoc).

Figure 1. An illustration of two ways in which exploratory data analyses may provide legitimate support for post hoc hypotheses

In the second part of our article, we argue that exploratory hypothesis tests can have several evidential advantages over confirmatory tests and, consequently, they have the potential to deliver more compelling research conclusions. In particular, exploratory hypothesis tests:

✅ avoid researcher commitment and prophecy biases

✅ reduce the motive for data fraud

✅ are more appropriate following unplanned deviations

✅ facilitate inference to the best explanation

✅ allow peer reviewers to contribute to exploratory analyses

Finally, in the third part of our article, we consider several potential *disadvantages* of exploratory hypothesis tests and conclude that these potential disadvantages may not be problematic. In particular, exploratory hypotheses tests are not necessarily disadvantaged due to:

✅overfitting

✅bias

✅HARKing

✅unacceptable research practices

And they:

✅are usually necessary

✅can be falsified

✅can predict anything but may suffer an evaluative cost in doing so

To be clear, our claim is not that exploratory hypothesis tests are always more compelling than confirmatory tests or even that they are typically more compelling. Our claim is only that exploratory tests can be more compelling in specific research situations. More generally, we encourage researchers to evaluate specific tests and results on a case-by-case basis rather than to follow simplistic heuristics such as “exploratory results are more tentative,” which represents a form of methodolatory.

Our paper builds on some of my previous work on preregistration and HARKing. And please check out Szollosi and Donkin’s (2021) paper on “the misguided distinction between exploratory and confirmatory research.”

For more info, please see our open access article:

Rubin, M., & Donkin, C. (2022). Exploratory hypothesis tests can be more compelling than confirmatory hypothesis tests. Philosophical Psychology. https://doi.org/10.1080/09515089.2022.2113771

Monday, 4 April 2022

Two-Sided Significance Tests

In this paper (Rubin, 2022), I make two related points: (1) researchers should halve two-sided p values if they wish to use them to make directional claims, and (2) researchers should not halve their alpha level if they're using two one-sided tests to test two directional null hypotheses.

(1) Researchers should halve two-sided p values when making directional claims

Researchers sometimes conduct two-sided significance tests and then use the resulting two-sided p values to make directional claims. I argue that this approach is inappropriate because two-sided p values refer to non-directional hypotheses, rather than directional hypotheses.

So, for example, if you conduct a two-sided t test and obtain a significant two-sided p value, then your significant result refers to a non-directional null hypothesis (e.g., "men have the same self-esteem as women”), and you should make a corresponding non-directional claim (e.g., "men and women have significantly different self-esteem"). If you wish to make a directional claim (e.g., "men have significantly higher self-esteem than women"), then you should halve your two-sided p value to obtain a one-side p value.

This first point is important because, if you use a two-sided p value to make a decision about a directional null hypothesis, then (a) your evidence will be weaker than it should be (i.e., your p value will be too large), and (b) your Type II error rate will be higher than necessary. For the same view, please see Georgi Georgiev’s onesided.org website here.

(2) Researchers should not halve their alpha level when using two one-sided tests

I also argue that, if you use two one-sided tests to test two directional null hypotheses, then it's not necessary to adjust your alpha level to compensate for multiple testing, because your decision about rejecting each directional hypothesis is based on a single test result, rather than multiple test results.

For example, imagine that you use a one-sided test to test the directional null hypothesis that “men have the same or lower self-esteem than women.” In this case, there's no need to lower your alpha level (e.g., from .050 to .025), because your Type I error rate only refers to a single test of a single null hypothesis. It doesn't refer to either (a) the other directional null hypothesis (i.e., “men have the same or higher self-esteem than women”) or (b) the non-directional null hypothesis (i.e., “men have the same self-esteem as women).” Consequently, no alpha adjustment is required. For similar views, please see Georgi Georgiev's piece here and my paper on multiple testing here.

When to Adjust Alpha During Multiple Testing

In this new paper (Rubin, 2021), I consider when researchers should adjust their alpha level (significance threshold) during multiple testing and multiple comparisons. I consider three types of multiple testing (disjunction, conjunction, and individual), and I argue that an alpha adjustment is only required for one of these three types.

There’s No Need to Adjust Alpha During Individual Testing

I argue that an alpha adjustment is not necessary when researchers undertake a single test of an individual null hypothesis, even when many such tests are conducted within the same study.

For example, in the jelly beans study below, it’s perfectly acceptable to claim that there’s “a link between green jelly beans and acne” using an unadjusted alpha level of .05 given that this claim is based on a single test of the hypothesis that green jelly beans cause acne rather than multiple tests of this hypothesis.

Retrieved from https://xkcd.com/882/

For a list of quotes from others that are consistent with my position on individual testing, please see Appendix B here.

To be clear, I’m not saying that an alpha adjustment is never necessary. It is necessary when at least one significant result would be sufficient to support a joint hypothesis that’s composed of several constituent hypotheses that each undergo testing (i.e., disjunction testing). For example, an alpha adjustment would be necessary to conclude that “jelly beans of one or more colours cause acne” because, in this case, a single significant result for at least one of the 20 colours of jelly beans would be sufficient to support this claim, and so a familywise error rate is relevant.

Studywise Error Rates are Not Usually Relevant

I also argue against the automatic (mindless) use of what I call studywise error rates – the familywise error rate that is associated with all of the hypotheses that are tested in a study. I argue that researchers should only be interested in studywise error rates if they are interested in testing the associated joint studywise hypotheses, and researchers are not usually interested in testing studywise hypotheses because they rarely have any theoretical relevance. As I explain in my paper, “in many cases, the joint studywise hypothesis has no relevance to researchers’ specific research questions, because its constituent hypotheses refer to comparisons and variables that have no theoretical or practical basis for joint consideration.”

For example, imagine that a researcher conducts a study in which they test gender, age, and nationality differences in alcohol use. Do they need to adjust their alpha level to account for their multiple testing? I argue “no” unless they want to test a studywise hypothesis that, for example: “Either (a) men drink more than women, (b) young people drink more than older people, or (c) the English drink more than Italians.” If the researcher does not want to test this potentially atheoretical joint hypothesis, then they should not be interested in controlling the associated familywise error rate, and instead they should consider each individual hypothesis separately. As I explain in my paper, “researchers should not be concerned about erroneous answers to questions that they are not asking.”

For a list of quotes that support my position on studywise error rates, please see Appendix A here.

My paper is a follow up to my 2017 paper that considers p values in exploratory analyses.

“Repeated sampling from the same population?” A critique of Neyman and Pearson’s responses to Fisher

In a new paper in the European Journal for Philosophy of Science, I consider Fisher's criticism that the Neyman-Pearson approach to hypothesis testing relies on the assumption of “repeated sampling from the same population” (Rubin, 2020). This criticism is problematic for the Neyman-Pearson approach because it implies that test users need to know, for sure, what counts as the same or equivalent population as their current population. If they don't know what counts as the same or equivalent population, then they can't specify a procedure that would be able to repeatedly sample from this population, rather than from other non-equivalent populations, and without this specification Neyman-Pearson long run error rates become meaningless.

I argue that, by definition, researchers do not know for sure what are the relevant and irrelevant features of their current populations. For example, in a psychology study, is the population “1st year undergraduate psychology students” or, more narrowly, “Australian 1st year undergraduate psychology students” or, more broadly, “psychology undergraduate students” or, even more broadly, “young people,” etc.? Researchers can make educated guesses about the relevant and irrelevant aspects of their population. However, they must concede that those guesses may be wrong. Consequently, if a researcher imagines a long run of repeated sampling, then they must imagine that they would make incorrect decisions about their null hypothesis due to not only Type I errors and Type II errors, but also Type III errors - errors caused by accidentally sampling from populations that are substantively different to their underspecified alternative and null populations. As Dennis et al. (2019) recently explained, "the 'Type 3' error of basing inferences on an inadequate model family is widely acknowledged to be a serious (if not fatal) scientific drawback of the Neyman-Pearson framework."

To be clear, the Neyman-Pearson approach does consider Type III errors. However, it considers them outside of each long run of repeated sampling. It does not allow Type III errors to occur inside a long run of repeated sampling, where the sampling must always be from a correctly specified family of "admissible" populations (Neyman, 1977, p. 106; Neyman & Pearson, 1933, p. 294). In my paper, I argue that researchers are unable to imagine a long run of repeated sampling from the same or equivalent populations as their current population because they are unclear about the relevant and irrelevant characteristics of their current population. Consequently, they are unable to rule out Type III errors within their imagined long run.

Following Fisher, I contrast scientific researchers with quality controllers in industrial production settings. Unlike researchers, quality controllers have clear knowledge about the relevant and irrelevant characteristics of their populations. For example, they are given a clear and unequivocal definition of Batch 57 on a production line, and they don't consider re-conceptualizing Batch 57 as including or excluding other features. They also know which aspects of their testing procedure are relevant and irrelevant, and they are provided with precise quality control standards that allow them to know, for sure, their smallest effect size of interest. Consequently, the Neyman-Pearson approach is suitable for quality controllers because quality controllers can imagine a testing process that repeatedly draws random samples from the same population over and over again. In contrast, the Neyman-Pearson approach is not appropriate in scientific investigations because researchers do not have a clear understanding of the relevant and irrelevant aspects of their populations, their tests, or the smallest effect size that represents their population. Indeed, they are "researchers" because they are "researching" these things. Hence, it is researchers' self-declared ignorance and doubt about the nature of their populations that renders Neyman-Pearson long run error rates scientifically meaningless.

Quality controllers sample blocks of cheese from Batch 57. Scientists are not quality controllers.

In my article, I also consider Pearson (1947) and Neyman's (1977) responses to Fisher's "repeated sampling" criticism, focusing in particular on how it affected their conceptualization of the alpha level (nominal Type I error rate).

Pearson (1947) proposed that the alpha level can be considered in relation to a "hypothetical repetition" of the same test. However, as discussed above, this interpretation is only appropriate when test users are sure about all of the equivalent and non-equivalent aspects of their testing method and population. By definition, test users who adopt the role of "researcher" are not 100% sure about these things. As Fisher (1956, p. 78) explained, for researchers, “the population in question is hypothetical,…it could be defined in many ways, and…the first to come to mind may be quite misleading.”

Echoing Neyman and Pearson (1933), Pearson (1947) also suggested that the alpha level can be interpreted as a personal "rule" that guides researchers’ behavior during hypothesis testing across substantively different populations. However, this interpretation fails to acknowledge that a researcher may break their personal rule and use different alpha levels in different testing situations. As Fisher (1956, p. 42) put it, "no scientific worker has a fixed level of significance at which from year to year, and in all circumstances, he rejects hypotheses."

Addressing the rule-breaking problem, Neyman (1977) agreed with Fisher (1956) that the same researcher could use different alpha levels in different circumstances. However, he proposed that a researcher's average alpha level could be viewed as an indicator of their typical Type I error rate. So, for example, if Jane has an average alpha level of .050, and Tina has an average alpha level of .005, then we know that Jane will make more Type I errors than Tina during the course of her research career. Critically, however, these two researchers’ average alpha levels tell us nothing about the Type I error rates of specific tests of specific hypotheses, and in that sense they are scientifically irrelevant. Hence, although Tina may have an average alpha of .005 over the course of her career, her alpha level for Test 1 of Hypothesis A may be .10, .001, or any other value. As scientists, rather than metascientists, we should be more interested in the nominal Type I error rate of Test 1 of Hypothesis A than in the typical Type I error rate of Tina.

I conclude that neither Neyman nor Pearson adequately rebutted Fisher’s “repeated sampling” criticism, and that their alternative interpretations of alpha levels are lacking. I then briefly outline Fisher’s own significance testing approach and consider how it avoids his "repeated sampling" criticism.

Does Preregistration Improve the Interpretablity and Credibility of Research Findings?

Preregistration entails researchers registering their planned research hypotheses, methods, and analyses in a time-stamped document before they undertake their data collection and analyses. This document is then made available with the published research report in order to allow readers to identify discrepancies between what the researchers originally planned to do and what they actually ended up doing. In a recent article (Rubin, 2020), I question whether this historical transparency facilitates judgments of credibility over and above what I call the contemporary transparency that is provided by (a) clear rationales for current hypotheses and analytical approaches, (b) public access to research data, materials, and code, and (c) demonstrations of the robustness of research conclusions to alternative interpretations and analytical approaches.

My article covers issues such as HARKing, multiple testing, p-hacking, forking paths, exploratory analyses, optional stopping, researchers’ biases, selective reporting, test severity, publication bias, and replication rates. I argue for a nuanced approach to these issues. In particular, I argue that only some of these issues are problematic, and only under some conditions. I also argue that, when they are problematic, these issues can be identified via contemporary transparency per se.

I conclude that preregistration’s historical transparency does not facilitate judgments about the credibility of research findings when researchers provide contemporary transparency. Of course, in many cases, researchers do not provide a sufficient degree of contemporary transparency (e.g., no open research data or materials), and in these cases preregistration’s historical transparency may provide some useful information.

However, I argue that historical transparency is a relatively narrow, researcher-centric form of transparency because it focuses attention on the predictions made by specific researchers at a specific point in time. In contrast, contemporary transparency allows research data to be considered from multiple, unplanned, theoretical and analytical perspectives while maintaining a high degree of research credibility. Hence, I suggest that the open science movement should push more towards contemporary transparency and less towards historical transparency.

What Type of Type I Error?

In a recent paper (Rubin, 2021), I consider two types of replication in relation to two types of Type I error probability. First, I consider the distinction between exact and direct replications. Exact replications duplicate all aspects of a study that could potentially affect the original result. In contrast, direct replications duplicate only those aspects of the study that are thought to be theoretically essential to reproduce the original result.

Second, I consider two types of Type I error probability. The Neyman-Pearson Type I error rate refers to the maximum frequency of incorrectly rejecting a null hypothesis if a test was to be repeatedly reconducted on a series of different random samples that are all drawn from the exact same null population. Hence, the Neyman-Pearson Type I error rate refers to a long run of exact replications. In contrast, the Fisherian Type I error probability is the probability of incorrectly rejecting a null hypothesis in relation to a hypothetical population that reflects the relevant characteristics of the particular sample under consideration. Hence, the Fisherian Type I error rate refers to a one-off sample rather than a series of samples that are drawn during a long run of exact replications.

I argue that social science deals with units of analysis (people, social groups, and social systems) that change over time. As the Greek philosopher Heraclitus put it: “No man ever steps in the same river twice, for it’s not the same river and he’s not the same man.” Rivers and men are what Schmidt (2009, p. 92) called "irreversible units" in that they are complex time-sensitive systems that accumulate history. The scientific investigation of these irreversible units cannot proceed on the assumption that exact replications are possible. Consequently, the Neyman-Pearson Type I error rate is irrelevant in social science, because it relies on a concept of exact replication that cannot take place in the case of people, social groups, and social systems. Why should social scientists be interested in an error rate for an impossible process of resampling from the same fixed and unchanging population?

"No [wo]man ever steps in the same river twice, for it’s not the same river and [s]he’s not the same [wo]man" ( Heraclitus)

I argue that the Fisherian Type I error probability is more appropriate in social science because it refers to one-off samples from hypothetical populations. In this case, researchers recognise that every sample comes from a potentially different population. Hence, researchers can apply the Fisherian Type I error probability to each sample-specific provisional decision that they make about rejecting the same substantive null hypothesis in a series of direct replications.

I conclude that the replication crisis may be partly (not wholly) due to researchers’ unrealistic expectations about replicability based on their consideration of the Neyman-Pearson Type I error rate across a long run of exact replications.

For further information, please see:

Rubin, M. (2021). What type of Type I error? Contrasting the Neyman-Pearson and Fisherian approaches in the context of exact and direct replications. Synthese, 198, 5809–5834. *Publisher’s version* *Self-archived version*

Thursday, 12 March 2020

Do p Values Lose their Meaning in Exploratory Analyses?

In Rubin (2017), I consider the idea that p values lose their meaning (become invalid) in exploratory analyses (i.e., non-preregistered analyses). I argue that this view is correct if researchers aim to control a familywise error rate that includes all of the hypotheses that they have tested, or could have tested, in their study (i.e., a universal, experimentwise, or studywise error rate). In this case, it is not possible to compute the required familywise error rate because the number of post hoc hypotheses that have been tested, or could have been tested, during exploratory analyses in the study is unknown. However, I argue that researchers are rarely interested in a studywise error rate because they are rarely interested in testing the joint studywise hypothesis to which this error rate refers.

For example, imagine that a researcher conducted a study in which they explored the associations between body weight and (1) gender, (2) age, (3) ethnicity, and (4) social class. This researcher is unlikely to be interested in a studywise null hypothesis that can be rejected following a significant result for any of their four tests, because this joint null hypothesis is unlikely to relate to any meaningful theory. Which theory proposes that gender, age, ethnicity, and social class all predict body weight for the same theoretical reason? And, if the researcher is not interested in making a decision about the studywise null hypothesis, then there is no need for them to lower the alpha level (α; the significance threshold) for each of their four tests (e.g., from α = .050 to α = .050/4 or .0125) in order to maintain the Type I error rate for their decision about the studywise hypothesis at α = .050. Instead, the researcher can test each of the four different associations individually (i.e., each at α = .050) in order to make a separate, independent claim about each of four theoretically independent hypotheses (e.g., "male participants weighed more than female participants, p = .021"). By analogy, a woman who takes a pregnancy test does not need to worry about the familywise error rate that either her pregnancy test, her fire alarm, or her email spam filter will yield a false positive result because the associated joint hypothesis is nonsensical.

Sometimes it doesn't make sense to combine different
hypotheses as part of the same family!

Researchers should only be concerned about the familywise error rate of a set of tests when that set refers to the same theoretically meaningful joint hypothesis. For example, a researcher who undertakes exploratory analyses should be concerned about the familywise error rate for the hypothesis that men weigh more than women if they use four different measures of weight, and they are prepared to accept a single significant difference on any of those four measures as grounds for rejecting the associated joint null hypothesis. In this case, they should reduce their alpha level for each constituent test (e.g., to α/4) in order to maintain their nominal Type I error rate for the joint hypothesis at α. Based on this reasoning, I argue that p values do not lose their meaning in exploratory analyses because (a) researchers are not usually interested in the studywise error rate, and (b) they are able to transparently and verifiably specify and control the familywise error rates for any theoretically meaningful post hoc joint hypotheses about which they make claims.

I also recommend that researchers undertake a few basic open sciences practices during exploratory analyses in order to alleviate concerns about potential p-hacking: (1) List all of the variables in the research study. (2) Undertake a sensitivity analysis to demonstrate that the research results are robust to alternative analytical approaches. (3) Make the research data and materials publicly available to allow readers to check whether the results for any relevant measures have been omitted from the research report.

For further information, please see:

Rubin, M. (2017). Do p values lose their meaning in exploratory analyses? It depends how you define the familywise error rate. Review of General Psychology, 21, 269-275. *Publisher’s version* *Self-archived version*

Wednesday, 11 March 2020

The Costs of HARKing: Does it Matter if Researchers Engage in Undisclosed Hypothesizing After the Results are Known?

While no-one's looking, a Texas sharpshooter fires his gun at a barn wall, walks up to his bullet holes, and paints targets around them. When his friends arrive, he points at the targets and claims he’s a good shot (de Groot, 2014; Rubin, 2017b). In 1998, Norbert Kerr discussed an analogous situation in which researchers engage in undisclosed hypothesizing after the results are known or HARKing. In this case, researchers conduct statistical tests, observe their results (bullet holes), and then construct post hoc hypotheses (paint targets) to fit these results. In their research reports, they then pretend that their post hoc hypotheses are actually a priori hypotheses. This questionable research practice is thought to have contributed to the replication crisis in science (e.g., Shrout & Rodgers, 2018), and it provides part of the rationale for researchers to publicly preregister their hypotheses before they conduct their analyses (Wagenmakers et al., 2012). In a recent BJPS article (Rubin, 2019), I discuss the concept of HARKing from a philosophical standpoint and then undertake a critical analysis of Kerr’s 12 potential costs of HARKing.

Source: Dirk-Jan Hoek. https://www.flickr.com/photos/23868780@N00/7374874302

I begin my article by noting that scientists do not make absolute, dichotomous judgements about theories and hypotheses being “true” or “false.” Instead, they make relative judgements about theories and hypotheses being more or less true that other theories and hypotheses in accounting for certain phenomena. These judgements can be described as estimates of relative verisimilitude (Cevolani & Festa, 2018).

I then note that a HARKer is obliged to provide a theoretical rationale for their secretly post hoc hypothesis in the Introduction section of their research report. Despite being secretly post hoc, this theoretical rationale provides a result-independent basis for an initial estimate of the relative verisimilitude of the HARKed hypothesis. (The rationale is "result-independent" because it doesn't formally refer to the current result. If it did, then the rationale's post hoc status would no longer be a secret!) The current result can then provide a second, epistemically independent basis for adjusting this initial estimate of verisimilitude upwards or downards (for a similar view, see Lewandowsky, 2019; Oberauer & Lewandowsky, 2019). Hence, readers can estimate the relative verisimilitude of a HARKed hypothesis (a) without taking the current result into account and (b) after taking the current result into account, even if they have been misled about when the researcher deduced the hypothesis. Consequently, readers can undertake a valid updating of the estimated relative verisimilitude of the hypothesis even though, unbeknowst to them, it has been HARKed. Importantly, there's no “double-counting” (Mayo, 2008), “circular reasoning” (Nosek et al., 2018), or violation of the use novelty principle here (Worrall, 1985, 2014), because the current result has not been used in the formal theoretical rationale for the HARKed hypothesis. Consequently, it's legitimate to use the current result to change (increase or decrease) the initial estimate of the relative verisimilitude of that hypothesis.

To translate this reasoning to the Texas sharpshooter analogy, it's necessary to distinguish HARKing from p-hacking. If our sharpshooter painted a new target around his stray bullet hole but retained his substantive claim that he's “a good shot,” then he'd be similar to a researcher who conducted multiple statistical tests and then selectively reported only those results that supported their original a priori substantive hypothesis. Frequentist researchers would call this researcher a “p-hacker” rather than a HARKer (Rubin, 2017b, p. 325; Simmons et al., 2011). To be a HARKer, researchers must also change their original a priori hypothesis or create a totally new one. Hence, a more appropriate analogy is to consider a sharpshooter who changes both their statistical hypothesis (i.e., paints a new target around their stray bullet hole) and their broader substantive hypothesis (their claim). Let's call her Jane!

Jane initially believes “I’m a good shot” (H1). However, after missing the target that she was aiming for (T1), she secretly paints a new target (T2) around her bullet hole and declares to her friends: "I'm a good shot, but I can't adjust for windy conditions. I aimed at T1, but there was a 30 mph easterly cross-wind. So, I knew I'd probably hit T2 instead." In this case, Jane has generated a new, post hoc hypothesis (H2) and passed it off as an a priori hypothesis. Note that, unlike our original Texas sharpshooter, Jane isn't being deceptive about her procedure here (i.e., what she actually did): It's true that she aimed her gun at T1. She's only being deceptive about the a priori status of H2, which she secretly developed after she missed T1 (i.e., she's HARKing). Importantly, however, Jane's deception doesn't prevent her friends from making a valid initial estimate of the verisimilitude of her HARKed hypothesis and then updating this estimate based on the location of her bullet hole:

"We know that Jane's always trained indoors. So, it makes sense that she hasn't learned to adjust for windy conditions. We also know that (a) Jane was aiming at T1, and (b) there was a 30 mph easterly cross-wind. Our calculations show that, if someone was a good shot, and they were aiming at T1, but they didn't adjust for an easterly 30 mph cross-wind, then their bullet would hit T2's location. So, our initial estimated verismilitude for H2 is relatively high. The evidence shows that Jane's bullet did, in fact, hit T2. Consequently, we can tentatively increase our support for H2: Jane appears to be a good shot who can't adjust for windy conditions. Of course, we'd also want to test H2 again by asking Jane to hit targets on both windy and non-windy days!"

We can predict the location of the sharpshooter's bullet hole on the basis of her (secretly HARKed) hypothesis that she is a good shot but cannot adjust for windy conditions. We can then use the location of the bullet hole to increase or decrease our estimated relative verisimilitude for this prediction. Source: https://pixabay.com/photos/woman-rifle-shoot-gun-weapon-2577104/

The second part of my paper provides a critical analysis of Kerr’s (1998) 12 costs of HARKing. For further information, please see:

Rubin, M. (2022). The costs of HARKing. The British Journal for the Philosophy of Science, 73. https://doi.org/10.1093/bjps/axz050 *Publisher’s free access* *Self-archived version*

References

Cevolani, G., & Festa, R. (2018). A partial consequence account of truthlikeness. Synthese. http://dx.doi.org/10.1007/s11229-018-01947-3
de Groot, A. D. (2014). The meaning of “significance” for different types of research (E. J. Wagenmakers, D. Borsboom, J. Verhagen, R. Kievit, M. Bakker, A. Cramer, . . . H. L. J. van der Maas). Acta Psychologica, 148, 188–194. http://dx.doi.org/10.1016/j.actpsy.2014.02.001
Kerr, N. L. (1998). HARKing: Hypothesizing after the results are known. Personality and Social Psychology Review, 2, 196-217. http://dx.doi.org/10.1207/s15327957pspr0203_4
Lewandowsky, S. (2019). Avoiding Nimitz Hill with more than a little red book: Summing up #PSprereg. https://featuredcontent.psychonomic.org/avoiding-nimitz-hill-with-more-than-a-little-red-book-summing-up-psprereg/
Mayo, D. G. (2008). How to discount double-counting when it counts: Some clarifications. The British Journal for the Philosophy of Science, 59, 857–879.
Nosek, B. A., Ebersole, C. R., DeHaven, A. C., & Mellor, D. T. (2018). The preregistration revolution. Proceedings of the National Academy of Sciences, 115, 2600-2606. http://dx.doi.org/10.1073/pnas.1708274114
Oberauer, K., & Lewandowsky, S., (2019). Addressing the theory crisis in psychology. Psychonomic Bulletin & Review. http://dx.doi.org/10.3758/s13423-019-01645-2
Rubin, M. (2017a). An evaluation of four solutions to the forking paths problem: Adjusted alpha, preregistration, sensitivity analyses, and abandoning the Neyman-Pearson approach. Review of General Psychology, 21, 321-329. http://dx.doi.org/10.1037/gpr0000135 *Self-archived version*
Rubin, M. (2017b). When does HARKing hurt? Identifying when different types of undisclosed post hoc hypothesizing harm scientific progress. Review of General Psychology, 21, 308-320. http://dx.doi.org/10.1037/gpr0000128 *Self-archived version*
Shrout, P. E., & Rodgers, J. L. (2018). Psychology, science, and knowledge construction: Broadening perspectives from the replication crisis. Annual Review of Psychology, 69, 487-510. http://dx.doi.org/10.1146/annurev-psych-122216-011845
Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22, 1359–1366. http://dx.doi.org/10.1177/0956797611417632
Wagenmakers, E. J., Wetzels, R., Borsboom, D., van der Maas, H. L., & Kievit, R. A. (2012). An agenda for purely confirmatory research. Perspectives on Psychological Science, 7, 632-638. http://dx.doi.org/10.1177/1745691612463078
Worrall, J. (1985). Scientific discovery and theory-confirmation. In J. C. Pitt (Ed.), Change and progress in modern science: Papers related to and arising from the Fourth International Conference on History and Philosophy of Science (pp. 301–331). Dordrecht, the Netherlands: Reidel. http://dx.doi.org/10.1007/978-94-009-6525-6_11
Worrall, J. (2014). Prediction and accommodation revisited. Studies in History and Philosophy of Science, 45, 54–61. http://dx.doi.org/10.1016/j.shpsa.2013.10.001

Mark Rubin's Research

Tuesday, 25 April 2023

Questionable Metascience Practices

10 QMPS

(2) Fast ‘n’ bropen scientific criticism: A quick, superficial, dismissive, and/or mocking style of scientific criticism

(3) Overplaying the role of replication: Assuming that replication is essential to science, and that it indexes “the truth”

(4) Unspecified replication rate targets: Assuming a replication rate is “too low” without specifying an “acceptable” rate

(5) Metabias: An unacknowledged bias towards explaining the replication crisis in terms of researcher bias

(6) A bias reduction assumption: Focusing on selective reporting as the primary form of researcher bias and assuming that it can be reduced without increasing other forms of bias

(7) Devaluing exploratory results: Devaluing an exploratory result as being more “tentative” than a confirmatory result without considering other relevant issues (e.g., quality of associated theory, methods, analyses, transparency)

(8) Presuming QRPs are problematic: Presuming that questionable research practices (e.g., HARKing) are always problematic research practices

(9) A focus on knowledge accumulation: Conceiving knowledge accumulation as the primary objective of science without considering (a) the role of specified ignorance or (b) different objectives in other philosophies of science

A Caveat

Further Information

My Other Work in This Area

Saturday, 27 August 2022

Exploratory hypothesis tests can be more compelling than confirmatory hypothesis tests

Monday, 4 April 2022

Two-Sided Significance Tests

(1) Researchers should halve two-sided p values when making directional claims

(2) Researchers should not halve their alpha level when using two one-sided tests

Wednesday, 7 July 2021

When to Adjust Alpha During Multiple Testing

There’s No Need to Adjust Alpha During Individual Testing

Studywise Error Rates are Not Usually Relevant

Sunday, 27 September 2020

“Repeated sampling from the same population?” A critique of Neyman and Pearson’s responses to Fisher

Sunday, 20 September 2020

Does Preregistration Improve the Interpretablity and Credibility of Research Findings?

Friday, 13 March 2020

What Type of Type I Error?

Thursday, 12 March 2020

Do p Values Lose their Meaning in Exploratory Analyses?

Wednesday, 11 March 2020

The Costs of HARKing: Does it Matter if Researchers Engage in Undisclosed Hypothesizing After the Results are Known?