Limits...
Comparability and reproducibility of biomedical data.

Huang Y, Gottardo R - Brief. Bioinformatics (2012)

Bottom Line: Because of the multiple steps involved in the data generation and analysis and the lack of details provided, it can be difficult for independent researchers to try to reproduce a published study.The availability of high-quality, reproducible data will also lead to more powerful analyses (or meta-analyses) where multiple data sets are combined to generate new knowledge.In this article, we review some of the recent developments regarding biomedical reproducibility and comparability and discuss some of the areas where the overall field could be improved.

View Article: PubMed Central - PubMed

Affiliation: Fred Hutchinson Cancer Research Center, 1100 Fairview Avenue North, Mailstop M2-C200, Seattle, WA 98109-1024, USA.

ABSTRACT
With the development of novel assay technologies, biomedical experiments and analyses have gone through substantial evolution. Today, a typical experiment can simultaneously measure hundreds to thousands of individual features (e.g. genes) in dozens of biological conditions, resulting in gigabytes of data that need to be processed and analyzed. Because of the multiple steps involved in the data generation and analysis and the lack of details provided, it can be difficult for independent researchers to try to reproduce a published study. With the recent outrage following the halt of a cancer clinical trial due to the lack of reproducibility of the published study, researchers are now facing heavy pressure to ensure that their results are reproducible. Despite the global demand, too many published studies remain non-reproducible mainly due to the lack of availability of experimental protocol, data and/or computer code. Scientific discovery is an iterative process, where a published study generates new knowledge and data, resulting in new follow-up studies or clinical trials based on these results. As such, it is important for the results of a study to be quickly confirmed or discarded to avoid wasting time and money on novel projects. The availability of high-quality, reproducible data will also lead to more powerful analyses (or meta-analyses) where multiple data sets are combined to generate new knowledge. In this article, we review some of the recent developments regarding biomedical reproducibility and comparability and discuss some of the areas where the overall field could be improved.

Show MeSH

Related in: MedlinePlus

Life cycle of scientific discoveries. The overall cycle is broken down into five different steps. After completion of all steps according to the reproducible guidelines (Table 1), the results would rapidly lead to confirmed (or discarded) discoveries. The confirmed discoveries would then be translated into new knowledge and data supporting novel studies.
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC3713713&req=5

bbs078-F1: Life cycle of scientific discoveries. The overall cycle is broken down into five different steps. After completion of all steps according to the reproducible guidelines (Table 1), the results would rapidly lead to confirmed (or discarded) discoveries. The confirmed discoveries would then be translated into new knowledge and data supporting novel studies.

Mentions: We examine a prototypical biomedical data generation process to illustrate factors that may negatively impact the C&R of the data throughout different stages of the process. As shown in Figure 1, a data generation process can be roughly broken down into three core stages (Steps 1–3) of information transformation from signals contained in biological samples to numeric values captured in data sets for analysis. In Step 1, biological samples are measured and raw instrument data are generated. There are several factors that may influence the C&R of data at this stage. These include some obvious factors such as the specific type of technologies (e.g. hybridization-based or sequence-based gene expression) [4–7] or platforms (e.g. Affymetrix, Illumina or Operon) [8–12], the Standard Operating Procedures (SOPs) for biological sample preparation, experimental design, experiment layout and measurement [13, 14], as well as other conditions that are often not specified in the experiment protocol. For example, the level of experience or expertise of the technicians performing the experiment [15, 16], or the origin of the reagents (e.g. batch effects [17, 18]) are also possible sources for differences between independent experimental results. Therefore, in Step 1, to increase the C&R of data, all these factors should be thought out and optimally controlled and standardized whenever possible. When factors such as technicians or reagent batches may not be standardizable across multiple studies or laboratories, a measuring system comprised of a specific platform using a specific technology should strive to minimize variations caused by these factors and increase robustness against changes in these factors. Whenever possible, the SOPs should be shared and made available to the community. Several online platforms are now available for storing and sharing such information including ‘elabprotocols’ (elabprotocols.com) and ‘figshare’ (figshare.com). In Step 2, raw information from an instrument is calibrated and quantified into numeric values. This step often involves image analyses for information alignment and/or dimension reduction. Consequently, the specific algorithms used to make such transformations, their implementation in software and the specific data storage structures, including data formats (i.e. databases or flat files) and variable naming conventions, are vital to maintaining data consistency and should be standardized and recorded to a maximal level for effective C&R of the data. We will refer to the data derived from this step as primary data versus the secondary data generated after Step 3. In some specific cases, primary data are derived directly from the instrument, but in many cases the extremely large size of the raw data (e.g. raw images) makes it prohibitive to share these and the lack of true raw data is accepted. In ‘Standards and Data Sharing’ section, we provide more discussions on data standards and data sharing. Finally, in Step 3, data from Step 2 are further (pre-) processed before study objective-driven analyses are conducted. This later step often involves further data alignment such as background adjustment or data aggregation such as per-biomarker summarization from multiple subset measurements. Certain quality assurance and control processing may also occur to remove unreliable data and reduce any systematic variations between data points. As in Step 2, the specification and implementation of the algorithms and the data storage structures should be tracked in the effort to maintain the C&R of the data. In ‘Reproducibility of Assay Results and Derived Data’ section, we will discuss some of the tools available to share Step 2 data and associated computer code for data processing and analysis.Figure 1


Comparability and reproducibility of biomedical data.

Huang Y, Gottardo R - Brief. Bioinformatics (2012)

Life cycle of scientific discoveries. The overall cycle is broken down into five different steps. After completion of all steps according to the reproducible guidelines (Table 1), the results would rapidly lead to confirmed (or discarded) discoveries. The confirmed discoveries would then be translated into new knowledge and data supporting novel studies.
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC3713713&req=5

bbs078-F1: Life cycle of scientific discoveries. The overall cycle is broken down into five different steps. After completion of all steps according to the reproducible guidelines (Table 1), the results would rapidly lead to confirmed (or discarded) discoveries. The confirmed discoveries would then be translated into new knowledge and data supporting novel studies.
Mentions: We examine a prototypical biomedical data generation process to illustrate factors that may negatively impact the C&R of the data throughout different stages of the process. As shown in Figure 1, a data generation process can be roughly broken down into three core stages (Steps 1–3) of information transformation from signals contained in biological samples to numeric values captured in data sets for analysis. In Step 1, biological samples are measured and raw instrument data are generated. There are several factors that may influence the C&R of data at this stage. These include some obvious factors such as the specific type of technologies (e.g. hybridization-based or sequence-based gene expression) [4–7] or platforms (e.g. Affymetrix, Illumina or Operon) [8–12], the Standard Operating Procedures (SOPs) for biological sample preparation, experimental design, experiment layout and measurement [13, 14], as well as other conditions that are often not specified in the experiment protocol. For example, the level of experience or expertise of the technicians performing the experiment [15, 16], or the origin of the reagents (e.g. batch effects [17, 18]) are also possible sources for differences between independent experimental results. Therefore, in Step 1, to increase the C&R of data, all these factors should be thought out and optimally controlled and standardized whenever possible. When factors such as technicians or reagent batches may not be standardizable across multiple studies or laboratories, a measuring system comprised of a specific platform using a specific technology should strive to minimize variations caused by these factors and increase robustness against changes in these factors. Whenever possible, the SOPs should be shared and made available to the community. Several online platforms are now available for storing and sharing such information including ‘elabprotocols’ (elabprotocols.com) and ‘figshare’ (figshare.com). In Step 2, raw information from an instrument is calibrated and quantified into numeric values. This step often involves image analyses for information alignment and/or dimension reduction. Consequently, the specific algorithms used to make such transformations, their implementation in software and the specific data storage structures, including data formats (i.e. databases or flat files) and variable naming conventions, are vital to maintaining data consistency and should be standardized and recorded to a maximal level for effective C&R of the data. We will refer to the data derived from this step as primary data versus the secondary data generated after Step 3. In some specific cases, primary data are derived directly from the instrument, but in many cases the extremely large size of the raw data (e.g. raw images) makes it prohibitive to share these and the lack of true raw data is accepted. In ‘Standards and Data Sharing’ section, we provide more discussions on data standards and data sharing. Finally, in Step 3, data from Step 2 are further (pre-) processed before study objective-driven analyses are conducted. This later step often involves further data alignment such as background adjustment or data aggregation such as per-biomarker summarization from multiple subset measurements. Certain quality assurance and control processing may also occur to remove unreliable data and reduce any systematic variations between data points. As in Step 2, the specification and implementation of the algorithms and the data storage structures should be tracked in the effort to maintain the C&R of the data. In ‘Reproducibility of Assay Results and Derived Data’ section, we will discuss some of the tools available to share Step 2 data and associated computer code for data processing and analysis.Figure 1

Bottom Line: Because of the multiple steps involved in the data generation and analysis and the lack of details provided, it can be difficult for independent researchers to try to reproduce a published study.The availability of high-quality, reproducible data will also lead to more powerful analyses (or meta-analyses) where multiple data sets are combined to generate new knowledge.In this article, we review some of the recent developments regarding biomedical reproducibility and comparability and discuss some of the areas where the overall field could be improved.

View Article: PubMed Central - PubMed

Affiliation: Fred Hutchinson Cancer Research Center, 1100 Fairview Avenue North, Mailstop M2-C200, Seattle, WA 98109-1024, USA.

ABSTRACT
With the development of novel assay technologies, biomedical experiments and analyses have gone through substantial evolution. Today, a typical experiment can simultaneously measure hundreds to thousands of individual features (e.g. genes) in dozens of biological conditions, resulting in gigabytes of data that need to be processed and analyzed. Because of the multiple steps involved in the data generation and analysis and the lack of details provided, it can be difficult for independent researchers to try to reproduce a published study. With the recent outrage following the halt of a cancer clinical trial due to the lack of reproducibility of the published study, researchers are now facing heavy pressure to ensure that their results are reproducible. Despite the global demand, too many published studies remain non-reproducible mainly due to the lack of availability of experimental protocol, data and/or computer code. Scientific discovery is an iterative process, where a published study generates new knowledge and data, resulting in new follow-up studies or clinical trials based on these results. As such, it is important for the results of a study to be quickly confirmed or discarded to avoid wasting time and money on novel projects. The availability of high-quality, reproducible data will also lead to more powerful analyses (or meta-analyses) where multiple data sets are combined to generate new knowledge. In this article, we review some of the recent developments regarding biomedical reproducibility and comparability and discuss some of the areas where the overall field could be improved.

Show MeSH
Related in: MedlinePlus