Limits...
Decision-making in research tasks with sequential testing.

Pfeiffer T, Rand DG, Dreber A - PLoS ONE (2009)

Bottom Line: In these scenarios, research tasks are solved sequentially, i.e. subsequent tests can be chosen depending on previous results.We investigate simple sequential testing and scenarios where only a selected subset of results can be published and used for future rounds of test choice.Our results may help optimize existing procedures used in the practice of scientific research and provide guidance for the development of novel forms of scholarly communication.

View Article: PubMed Central - PubMed

Affiliation: Program for Evolutionary Dynamics, Harvard University, Cambridge, Massachusetts, United States of America. pfeiffer@fas.harvard.edu

ABSTRACT

Background: In a recent controversial essay, published by JPA Ioannidis in PLoS Medicine, it has been argued that in some research fields, most of the published findings are false. Based on theoretical reasoning it can be shown that small effect sizes, error-prone tests, low priors of the tested hypotheses and biases in the evaluation and publication of research findings increase the fraction of false positives. These findings raise concerns about the reliability of research. However, they are based on a very simple scenario of scientific research, where single tests are used to evaluate independent hypotheses.

Methodology/principal findings: In this study, we present computer simulations and experimental approaches for analyzing more realistic scenarios. In these scenarios, research tasks are solved sequentially, i.e. subsequent tests can be chosen depending on previous results. We investigate simple sequential testing and scenarios where only a selected subset of results can be published and used for future rounds of test choice. Results from computer simulations indicate that for the tasks analyzed in this study, the fraction of false among the positive findings declines over several rounds of testing if the most informative tests are performed. Our experiments show that human subjects frequently perform the most informative tests, leading to a decline of false positives as expected from the simulations.

Conclusions/significance: For the research tasks studied here, findings tend to become more reliable over time. We also find that the performance in those experimental settings where not all performed tests could be published turned out to be surprisingly inefficient. Our results may help optimize existing procedures used in the practice of scientific research and provide guidance for the development of novel forms of scholarly communication.

Show MeSH
Experimental results for simple scenario where all results are published, and results from the corresponding simulations (SIM-R and SIM-1).(A) Performance (mean log odds for the true hypothesis after the last round, and standard error of the mean). Performance falls in between the performance for random test choice (SIM-R) and the simulated scenario with informative test choice (SIM-1). This indicates that informative tests tend to be chosen preferentially but not always. The performance in EXP-1S seems to be worse in EXP-1S* and EXP-1G. This implies that solving a task in the more complex group setting EXP-1G has no negative impact on performance. Moreover knowing the error rates seems not to be of advantage for problem solving in the experiment. (B) Fraction of false among the positive results. Data from all three simple settings are pooled for panels B–D. The dynamics of false positives follows the patterns expected from simulation SIM-1. Yet, it is less pronounced because the participants sometimes fail to select the most informative tests. (C) Fraction of false among the negative results. The pattern is as expected from simulation SIM-1. However, it is less pronounced because the participants sometimes fail to select the most informative test. (D) Frequency of those among the chosen tests that support the true hypotheses. Over the rounds, participants more often select those tests that correspond to the correct sequence. Thus false positives decrease while false negatives increase.
© Copyright Policy
Related In: Results  -  Collection


getmorefigures.php?uid=PMC2643008&req=5

pone-0004607-g003: Experimental results for simple scenario where all results are published, and results from the corresponding simulations (SIM-R and SIM-1).(A) Performance (mean log odds for the true hypothesis after the last round, and standard error of the mean). Performance falls in between the performance for random test choice (SIM-R) and the simulated scenario with informative test choice (SIM-1). This indicates that informative tests tend to be chosen preferentially but not always. The performance in EXP-1S seems to be worse in EXP-1S* and EXP-1G. This implies that solving a task in the more complex group setting EXP-1G has no negative impact on performance. Moreover knowing the error rates seems not to be of advantage for problem solving in the experiment. (B) Fraction of false among the positive results. Data from all three simple settings are pooled for panels B–D. The dynamics of false positives follows the patterns expected from simulation SIM-1. Yet, it is less pronounced because the participants sometimes fail to select the most informative tests. (C) Fraction of false among the negative results. The pattern is as expected from simulation SIM-1. However, it is less pronounced because the participants sometimes fail to select the most informative test. (D) Frequency of those among the chosen tests that support the true hypotheses. Over the rounds, participants more often select those tests that correspond to the correct sequence. Thus false positives decrease while false negatives increase.

Mentions: Figure 3 shows that for the simple scenarios with single tests in each round (EXP-1S, EXP-1S* and EXP-1G), the odds for the true hypotheses lie between the odds from the simulations with random test choice (SIM-R) and the simulations with informed test choice (SIM-1). Performance is better than for random test choice (t = 1.4, p = 0.17; t = 3.3, p = 0.002; t = 2.6, p = 0.01 for EXP-1S, EXP-1G, EXP-1S* vs. SIM-R), but not as good as for informed test choice (t = −4.4, p = 3e-5; t = −2.1, p = 0.04; t = −2.0, p = 0.05 for EXP-1S, EXP-1G, EXP-1S* vs. SIM-1). This implies that the participants preferentially choose informative tests, but sometimes failed to pick the most informative one. As indicated above, the performance in EXP-1G and EXP-1S* tends to be better than the performance in setting EXP-1S (t = 1.4, p = 0.15; t = 1.2, p = 0.23; for EXP-1G, EXP-1S* vs. EXP-1S). Thus, participants had no problems with the somewhat more complicated setting of solving several tasks simultaneously in groups (EXP-1G) rather than individually (EXP-1S). The observation that performance in EXP-1S* is better than performance in EXP-1S suggests that the participants could not take advantage of knowing the error rates. On the contrary, knowing the error rates seems to have a negative effect on the heuristics used for solving the tasks.


Decision-making in research tasks with sequential testing.

Pfeiffer T, Rand DG, Dreber A - PLoS ONE (2009)

Experimental results for simple scenario where all results are published, and results from the corresponding simulations (SIM-R and SIM-1).(A) Performance (mean log odds for the true hypothesis after the last round, and standard error of the mean). Performance falls in between the performance for random test choice (SIM-R) and the simulated scenario with informative test choice (SIM-1). This indicates that informative tests tend to be chosen preferentially but not always. The performance in EXP-1S seems to be worse in EXP-1S* and EXP-1G. This implies that solving a task in the more complex group setting EXP-1G has no negative impact on performance. Moreover knowing the error rates seems not to be of advantage for problem solving in the experiment. (B) Fraction of false among the positive results. Data from all three simple settings are pooled for panels B–D. The dynamics of false positives follows the patterns expected from simulation SIM-1. Yet, it is less pronounced because the participants sometimes fail to select the most informative tests. (C) Fraction of false among the negative results. The pattern is as expected from simulation SIM-1. However, it is less pronounced because the participants sometimes fail to select the most informative test. (D) Frequency of those among the chosen tests that support the true hypotheses. Over the rounds, participants more often select those tests that correspond to the correct sequence. Thus false positives decrease while false negatives increase.
© Copyright Policy
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC2643008&req=5

pone-0004607-g003: Experimental results for simple scenario where all results are published, and results from the corresponding simulations (SIM-R and SIM-1).(A) Performance (mean log odds for the true hypothesis after the last round, and standard error of the mean). Performance falls in between the performance for random test choice (SIM-R) and the simulated scenario with informative test choice (SIM-1). This indicates that informative tests tend to be chosen preferentially but not always. The performance in EXP-1S seems to be worse in EXP-1S* and EXP-1G. This implies that solving a task in the more complex group setting EXP-1G has no negative impact on performance. Moreover knowing the error rates seems not to be of advantage for problem solving in the experiment. (B) Fraction of false among the positive results. Data from all three simple settings are pooled for panels B–D. The dynamics of false positives follows the patterns expected from simulation SIM-1. Yet, it is less pronounced because the participants sometimes fail to select the most informative tests. (C) Fraction of false among the negative results. The pattern is as expected from simulation SIM-1. However, it is less pronounced because the participants sometimes fail to select the most informative test. (D) Frequency of those among the chosen tests that support the true hypotheses. Over the rounds, participants more often select those tests that correspond to the correct sequence. Thus false positives decrease while false negatives increase.
Mentions: Figure 3 shows that for the simple scenarios with single tests in each round (EXP-1S, EXP-1S* and EXP-1G), the odds for the true hypotheses lie between the odds from the simulations with random test choice (SIM-R) and the simulations with informed test choice (SIM-1). Performance is better than for random test choice (t = 1.4, p = 0.17; t = 3.3, p = 0.002; t = 2.6, p = 0.01 for EXP-1S, EXP-1G, EXP-1S* vs. SIM-R), but not as good as for informed test choice (t = −4.4, p = 3e-5; t = −2.1, p = 0.04; t = −2.0, p = 0.05 for EXP-1S, EXP-1G, EXP-1S* vs. SIM-1). This implies that the participants preferentially choose informative tests, but sometimes failed to pick the most informative one. As indicated above, the performance in EXP-1G and EXP-1S* tends to be better than the performance in setting EXP-1S (t = 1.4, p = 0.15; t = 1.2, p = 0.23; for EXP-1G, EXP-1S* vs. EXP-1S). Thus, participants had no problems with the somewhat more complicated setting of solving several tasks simultaneously in groups (EXP-1G) rather than individually (EXP-1S). The observation that performance in EXP-1S* is better than performance in EXP-1S suggests that the participants could not take advantage of knowing the error rates. On the contrary, knowing the error rates seems to have a negative effect on the heuristics used for solving the tasks.

Bottom Line: In these scenarios, research tasks are solved sequentially, i.e. subsequent tests can be chosen depending on previous results.We investigate simple sequential testing and scenarios where only a selected subset of results can be published and used for future rounds of test choice.Our results may help optimize existing procedures used in the practice of scientific research and provide guidance for the development of novel forms of scholarly communication.

View Article: PubMed Central - PubMed

Affiliation: Program for Evolutionary Dynamics, Harvard University, Cambridge, Massachusetts, United States of America. pfeiffer@fas.harvard.edu

ABSTRACT

Background: In a recent controversial essay, published by JPA Ioannidis in PLoS Medicine, it has been argued that in some research fields, most of the published findings are false. Based on theoretical reasoning it can be shown that small effect sizes, error-prone tests, low priors of the tested hypotheses and biases in the evaluation and publication of research findings increase the fraction of false positives. These findings raise concerns about the reliability of research. However, they are based on a very simple scenario of scientific research, where single tests are used to evaluate independent hypotheses.

Methodology/principal findings: In this study, we present computer simulations and experimental approaches for analyzing more realistic scenarios. In these scenarios, research tasks are solved sequentially, i.e. subsequent tests can be chosen depending on previous results. We investigate simple sequential testing and scenarios where only a selected subset of results can be published and used for future rounds of test choice. Results from computer simulations indicate that for the tasks analyzed in this study, the fraction of false among the positive findings declines over several rounds of testing if the most informative tests are performed. Our experiments show that human subjects frequently perform the most informative tests, leading to a decline of false positives as expected from the simulations.

Conclusions/significance: For the research tasks studied here, findings tend to become more reliable over time. We also find that the performance in those experimental settings where not all performed tests could be published turned out to be surprisingly inefficient. Our results may help optimize existing procedures used in the practice of scientific research and provide guidance for the development of novel forms of scholarly communication.

Show MeSH