Limits...
Decision-making in research tasks with sequential testing.

Pfeiffer T, Rand DG, Dreber A - PLoS ONE (2009)

Bottom Line: In these scenarios, research tasks are solved sequentially, i.e. subsequent tests can be chosen depending on previous results.We investigate simple sequential testing and scenarios where only a selected subset of results can be published and used for future rounds of test choice.Our results may help optimize existing procedures used in the practice of scientific research and provide guidance for the development of novel forms of scholarly communication.

View Article: PubMed Central - PubMed

Affiliation: Program for Evolutionary Dynamics, Harvard University, Cambridge, Massachusetts, United States of America. pfeiffer@fas.harvard.edu

ABSTRACT

Background: In a recent controversial essay, published by JPA Ioannidis in PLoS Medicine, it has been argued that in some research fields, most of the published findings are false. Based on theoretical reasoning it can be shown that small effect sizes, error-prone tests, low priors of the tested hypotheses and biases in the evaluation and publication of research findings increase the fraction of false positives. These findings raise concerns about the reliability of research. However, they are based on a very simple scenario of scientific research, where single tests are used to evaluate independent hypotheses.

Methodology/principal findings: In this study, we present computer simulations and experimental approaches for analyzing more realistic scenarios. In these scenarios, research tasks are solved sequentially, i.e. subsequent tests can be chosen depending on previous results. We investigate simple sequential testing and scenarios where only a selected subset of results can be published and used for future rounds of test choice. Results from computer simulations indicate that for the tasks analyzed in this study, the fraction of false among the positive findings declines over several rounds of testing if the most informative tests are performed. Our experiments show that human subjects frequently perform the most informative tests, leading to a decline of false positives as expected from the simulations.

Conclusions/significance: For the research tasks studied here, findings tend to become more reliable over time. We also find that the performance in those experimental settings where not all performed tests could be published turned out to be surprisingly inefficient. Our results may help optimize existing procedures used in the practice of scientific research and provide guidance for the development of novel forms of scholarly communication.

Show MeSH
Experimental results for complex scenarios where in each rounds, two tests can be performed but only one result can be published.(A) Performance (mean log odds for the true hypothesis after the last round, and standard error of the mean). Performance for the settings with coordinated test choice (EXP-2G and EXP-2G*) falls in between the performance of simulations with random test choice (SIM-R), and informative test choice (SIM-2). Performance in the setting with independent test choice is worst, and not better than random. This suggests that there are substantial negative effects arising from independent testing. Analogous to EXP-1S and EXP-1S*, there seems to be a disadvantage for knowing the error rates: Participants in EXP-2G* outperform those in EXP-2G. (B) Fraction of false among the positive results. Data from all three complex settings are pooled for panels B–D. The dynamics of false positives follows the patterns expected from simulation SIM-2. (C) Fraction of false among the negative results. The fraction of false negatives in the experiments (EXP-2G, EXP-2G* and EXP-2E) is larger than expected from simulation SIM-2. (D) Frequency of those among the chosen tests that support the true hypotheses. Over the rounds, participants frequently select those tests that correspond to the correct sequence. This leads to a decline of false positives. In contrast to the simulations (SIM-2; dashed grey line), selection against false findings does not work efficiently (solid grey line). While in the simulations, false findings are increasingly selected against, there is no substantial improvement in avoiding publication of false findings in the experiment. This suggests that human subjects did not efficiently use background knowledge to avoid the publication of false findings.
© Copyright Policy
Related In: Results  -  Collection


getmorefigures.php?uid=PMC2643008&req=5

pone-0004607-g004: Experimental results for complex scenarios where in each rounds, two tests can be performed but only one result can be published.(A) Performance (mean log odds for the true hypothesis after the last round, and standard error of the mean). Performance for the settings with coordinated test choice (EXP-2G and EXP-2G*) falls in between the performance of simulations with random test choice (SIM-R), and informative test choice (SIM-2). Performance in the setting with independent test choice is worst, and not better than random. This suggests that there are substantial negative effects arising from independent testing. Analogous to EXP-1S and EXP-1S*, there seems to be a disadvantage for knowing the error rates: Participants in EXP-2G* outperform those in EXP-2G. (B) Fraction of false among the positive results. Data from all three complex settings are pooled for panels B–D. The dynamics of false positives follows the patterns expected from simulation SIM-2. (C) Fraction of false among the negative results. The fraction of false negatives in the experiments (EXP-2G, EXP-2G* and EXP-2E) is larger than expected from simulation SIM-2. (D) Frequency of those among the chosen tests that support the true hypotheses. Over the rounds, participants frequently select those tests that correspond to the correct sequence. This leads to a decline of false positives. In contrast to the simulations (SIM-2; dashed grey line), selection against false findings does not work efficiently (solid grey line). While in the simulations, false findings are increasingly selected against, there is no substantial improvement in avoiding publication of false findings in the experiment. This suggests that human subjects did not efficiently use background knowledge to avoid the publication of false findings.

Mentions: For the more complex scenarios, we observe that the performance in EXP-2G and EXP-2G* falls in between the performance from the simulated scenarios with random test choice and the simulated scenario with informative test choice and subsequent selection (Fig. 4A; t = 4.3, p = 4e-5; t = 4.3, p = 8e-5; for EXP-2G, EXP-2G* vs. SIM-R; and t = −7.3, p = 6e-11; t = −2.1, p = 0.04; for EXP-2G, EXP-2G* vs. SIM-2). Surprisingly, performance in EXP-2G is not much better than in EXP-1G (t = −0.23; p = 0.8). Thus, the participants did not take advantage of the additional information they got from performing two tests, although the computer simulations clearly demonstrate that this is possible. Analogous to the differences between EXP-1S* and EXP-1S, the performance tends to be better in EXP-2G* than in EXP-2G (t = 1.7, p = 0.09). Again, we observe that informing participants about the error rates seems to have a negative impact on performance. Performance is worst in scenario EXP-2E (t = −2.1, p = 0.04 for EXP-2E vs. EXP-2G). In this scenario, performance is not better than random test choice (t = −0.07; p = 0.94). This outcome is surprising. One could expect that performance in EXP-2G is better than performance in EXP-2E, because in EXP-2G the participant can choose tests in a coordinated fashion, while in EXP-2E tests are chosen independently. However, that performance in EXP-2E is not better than random test choice indicates that in the experiments there is a substantial negative effect arising from independent test choice.


Decision-making in research tasks with sequential testing.

Pfeiffer T, Rand DG, Dreber A - PLoS ONE (2009)

Experimental results for complex scenarios where in each rounds, two tests can be performed but only one result can be published.(A) Performance (mean log odds for the true hypothesis after the last round, and standard error of the mean). Performance for the settings with coordinated test choice (EXP-2G and EXP-2G*) falls in between the performance of simulations with random test choice (SIM-R), and informative test choice (SIM-2). Performance in the setting with independent test choice is worst, and not better than random. This suggests that there are substantial negative effects arising from independent testing. Analogous to EXP-1S and EXP-1S*, there seems to be a disadvantage for knowing the error rates: Participants in EXP-2G* outperform those in EXP-2G. (B) Fraction of false among the positive results. Data from all three complex settings are pooled for panels B–D. The dynamics of false positives follows the patterns expected from simulation SIM-2. (C) Fraction of false among the negative results. The fraction of false negatives in the experiments (EXP-2G, EXP-2G* and EXP-2E) is larger than expected from simulation SIM-2. (D) Frequency of those among the chosen tests that support the true hypotheses. Over the rounds, participants frequently select those tests that correspond to the correct sequence. This leads to a decline of false positives. In contrast to the simulations (SIM-2; dashed grey line), selection against false findings does not work efficiently (solid grey line). While in the simulations, false findings are increasingly selected against, there is no substantial improvement in avoiding publication of false findings in the experiment. This suggests that human subjects did not efficiently use background knowledge to avoid the publication of false findings.
© Copyright Policy
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC2643008&req=5

pone-0004607-g004: Experimental results for complex scenarios where in each rounds, two tests can be performed but only one result can be published.(A) Performance (mean log odds for the true hypothesis after the last round, and standard error of the mean). Performance for the settings with coordinated test choice (EXP-2G and EXP-2G*) falls in between the performance of simulations with random test choice (SIM-R), and informative test choice (SIM-2). Performance in the setting with independent test choice is worst, and not better than random. This suggests that there are substantial negative effects arising from independent testing. Analogous to EXP-1S and EXP-1S*, there seems to be a disadvantage for knowing the error rates: Participants in EXP-2G* outperform those in EXP-2G. (B) Fraction of false among the positive results. Data from all three complex settings are pooled for panels B–D. The dynamics of false positives follows the patterns expected from simulation SIM-2. (C) Fraction of false among the negative results. The fraction of false negatives in the experiments (EXP-2G, EXP-2G* and EXP-2E) is larger than expected from simulation SIM-2. (D) Frequency of those among the chosen tests that support the true hypotheses. Over the rounds, participants frequently select those tests that correspond to the correct sequence. This leads to a decline of false positives. In contrast to the simulations (SIM-2; dashed grey line), selection against false findings does not work efficiently (solid grey line). While in the simulations, false findings are increasingly selected against, there is no substantial improvement in avoiding publication of false findings in the experiment. This suggests that human subjects did not efficiently use background knowledge to avoid the publication of false findings.
Mentions: For the more complex scenarios, we observe that the performance in EXP-2G and EXP-2G* falls in between the performance from the simulated scenarios with random test choice and the simulated scenario with informative test choice and subsequent selection (Fig. 4A; t = 4.3, p = 4e-5; t = 4.3, p = 8e-5; for EXP-2G, EXP-2G* vs. SIM-R; and t = −7.3, p = 6e-11; t = −2.1, p = 0.04; for EXP-2G, EXP-2G* vs. SIM-2). Surprisingly, performance in EXP-2G is not much better than in EXP-1G (t = −0.23; p = 0.8). Thus, the participants did not take advantage of the additional information they got from performing two tests, although the computer simulations clearly demonstrate that this is possible. Analogous to the differences between EXP-1S* and EXP-1S, the performance tends to be better in EXP-2G* than in EXP-2G (t = 1.7, p = 0.09). Again, we observe that informing participants about the error rates seems to have a negative impact on performance. Performance is worst in scenario EXP-2E (t = −2.1, p = 0.04 for EXP-2E vs. EXP-2G). In this scenario, performance is not better than random test choice (t = −0.07; p = 0.94). This outcome is surprising. One could expect that performance in EXP-2G is better than performance in EXP-2E, because in EXP-2G the participant can choose tests in a coordinated fashion, while in EXP-2E tests are chosen independently. However, that performance in EXP-2E is not better than random test choice indicates that in the experiments there is a substantial negative effect arising from independent test choice.

Bottom Line: In these scenarios, research tasks are solved sequentially, i.e. subsequent tests can be chosen depending on previous results.We investigate simple sequential testing and scenarios where only a selected subset of results can be published and used for future rounds of test choice.Our results may help optimize existing procedures used in the practice of scientific research and provide guidance for the development of novel forms of scholarly communication.

View Article: PubMed Central - PubMed

Affiliation: Program for Evolutionary Dynamics, Harvard University, Cambridge, Massachusetts, United States of America. pfeiffer@fas.harvard.edu

ABSTRACT

Background: In a recent controversial essay, published by JPA Ioannidis in PLoS Medicine, it has been argued that in some research fields, most of the published findings are false. Based on theoretical reasoning it can be shown that small effect sizes, error-prone tests, low priors of the tested hypotheses and biases in the evaluation and publication of research findings increase the fraction of false positives. These findings raise concerns about the reliability of research. However, they are based on a very simple scenario of scientific research, where single tests are used to evaluate independent hypotheses.

Methodology/principal findings: In this study, we present computer simulations and experimental approaches for analyzing more realistic scenarios. In these scenarios, research tasks are solved sequentially, i.e. subsequent tests can be chosen depending on previous results. We investigate simple sequential testing and scenarios where only a selected subset of results can be published and used for future rounds of test choice. Results from computer simulations indicate that for the tasks analyzed in this study, the fraction of false among the positive findings declines over several rounds of testing if the most informative tests are performed. Our experiments show that human subjects frequently perform the most informative tests, leading to a decline of false positives as expected from the simulations.

Conclusions/significance: For the research tasks studied here, findings tend to become more reliable over time. We also find that the performance in those experimental settings where not all performed tests could be published turned out to be surprisingly inefficient. Our results may help optimize existing procedures used in the practice of scientific research and provide guidance for the development of novel forms of scholarly communication.

Show MeSH