Limits...
A reproducible approach to high-throughput biological data acquisition and integration.

Börnigen D, Moon YS, Rahnavard G, Waldron L, McIver L, Shafquat A, Franzosa EA, Miropolsky L, Sweeney C, Morgan XC, Garrett WS, Huttenhower C - PeerJ (2015)

Bottom Line: Although large systematic meta-analyses are among the most effective approaches both for clinical biomarker discovery and for computational inference of biomolecular mechanisms, identifying, acquiring, and integrating relevant experimental results from multiple sources for a given study can be time-consuming and error-prone.To enable efficient and reproducible integration of diverse experimental results, we developed a novel approach for standardized acquisition and analysis of high-throughput and heterogeneous biological data.Finally, we constructed integrated functional interaction networks to compare connectivity of peptide secretion pathways in the model organisms Escherichia coli, Bacillus subtilis, and Pseudomonas aeruginosa.

View Article: PubMed Central - HTML - PubMed

Affiliation: Biostatistics Department, Harvard School of Public Health , Boston, MA , USA ; The Broad Institute of MIT and Harvard , Cambridge, MA , USA.

ABSTRACT
Modern biological research requires rapid, complex, and reproducible integration of multiple experimental results generated both internally and externally (e.g., from public repositories). Although large systematic meta-analyses are among the most effective approaches both for clinical biomarker discovery and for computational inference of biomolecular mechanisms, identifying, acquiring, and integrating relevant experimental results from multiple sources for a given study can be time-consuming and error-prone. To enable efficient and reproducible integration of diverse experimental results, we developed a novel approach for standardized acquisition and analysis of high-throughput and heterogeneous biological data. This allowed, first, novel biomolecular network reconstruction in human prostate cancer, which correctly recovered and extended the NFκB signaling pathway. Next, we investigated host-microbiome interactions. In less than an hour of analysis time, the system retrieved data and integrated six germ-free murine intestinal gene expression datasets to identify the genes most influenced by the gut microbiota, which comprised a set of immune-response and carbohydrate metabolism processes. Finally, we constructed integrated functional interaction networks to compare connectivity of peptide secretion pathways in the model organisms Escherichia coli, Bacillus subtilis, and Pseudomonas aeruginosa.

No MeSH data available.


Related in: MedlinePlus

Analysis and processing steps available for datasets from each data source.The main steps of ARepA are divided up into four components: (1) Configuration and data integration: optional user-provided information can be merged with default data/metadata from the repositories. This allows, for example, integration of expert curated metadata with automatically annotated metadata. (2) Custom data processing, including the default and customizable gene mapping and metadata annotation, as well as processes for file format detection and conversion. (3) Data normalization: gene identifiers are standardized, gene expression levels are normalized (e.g., log-transformed), missing values are imputed using k-nearest neighborhoods, and duplicate entries are merged. (4) Data export: data file formats are normalized to tab-delimited text, and co-expression networks in text and binary formats are constructed. Gene expression datasets and automatically generated documentation are further compiled into an R data file.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4493686&req=5

fig-5: Analysis and processing steps available for datasets from each data source.The main steps of ARepA are divided up into four components: (1) Configuration and data integration: optional user-provided information can be merged with default data/metadata from the repositories. This allows, for example, integration of expert curated metadata with automatically annotated metadata. (2) Custom data processing, including the default and customizable gene mapping and metadata annotation, as well as processes for file format detection and conversion. (3) Data normalization: gene identifiers are standardized, gene expression levels are normalized (e.g., log-transformed), missing values are imputed using k-nearest neighborhoods, and duplicate entries are merged. (4) Data export: data file formats are normalized to tab-delimited text, and co-expression networks in text and binary formats are constructed. Gene expression datasets and automatically generated documentation are further compiled into an R data file.

Mentions: As an example of dedicated module implementation, the GEO module first downloads raw gene expression data (SOFT or SeriesMatrix files) as well as platform annotations from the Gene Expression Omnibus (Fig. 5). The number of platforms used by each dataset is determined, and data for each platform is built recursively. The expression data is parsed, normalized and imputed (Huttenhower et al., 2008), and gene IDs are standardized using GeneMapper. Finally, metadata is created for each dataset, combining automatically-derived information from the SOFT files with manually curated tables if available. The final outputs are a standardized and consistent gene expression matrix encoded as tab-delimited text, corresponding metadata, a normalized co-expression network, and an R data file containing an expression set. Other dedicated modules, such as those for interaction data, operate similarly by first downloading and parsing the requested data and directly generating tab-delimited text networks for each dataset (typically separated by taxon ID and publication ID). Standardization of interaction data and metadata are performed as described for the GEO module.


A reproducible approach to high-throughput biological data acquisition and integration.

Börnigen D, Moon YS, Rahnavard G, Waldron L, McIver L, Shafquat A, Franzosa EA, Miropolsky L, Sweeney C, Morgan XC, Garrett WS, Huttenhower C - PeerJ (2015)

Analysis and processing steps available for datasets from each data source.The main steps of ARepA are divided up into four components: (1) Configuration and data integration: optional user-provided information can be merged with default data/metadata from the repositories. This allows, for example, integration of expert curated metadata with automatically annotated metadata. (2) Custom data processing, including the default and customizable gene mapping and metadata annotation, as well as processes for file format detection and conversion. (3) Data normalization: gene identifiers are standardized, gene expression levels are normalized (e.g., log-transformed), missing values are imputed using k-nearest neighborhoods, and duplicate entries are merged. (4) Data export: data file formats are normalized to tab-delimited text, and co-expression networks in text and binary formats are constructed. Gene expression datasets and automatically generated documentation are further compiled into an R data file.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4493686&req=5

fig-5: Analysis and processing steps available for datasets from each data source.The main steps of ARepA are divided up into four components: (1) Configuration and data integration: optional user-provided information can be merged with default data/metadata from the repositories. This allows, for example, integration of expert curated metadata with automatically annotated metadata. (2) Custom data processing, including the default and customizable gene mapping and metadata annotation, as well as processes for file format detection and conversion. (3) Data normalization: gene identifiers are standardized, gene expression levels are normalized (e.g., log-transformed), missing values are imputed using k-nearest neighborhoods, and duplicate entries are merged. (4) Data export: data file formats are normalized to tab-delimited text, and co-expression networks in text and binary formats are constructed. Gene expression datasets and automatically generated documentation are further compiled into an R data file.
Mentions: As an example of dedicated module implementation, the GEO module first downloads raw gene expression data (SOFT or SeriesMatrix files) as well as platform annotations from the Gene Expression Omnibus (Fig. 5). The number of platforms used by each dataset is determined, and data for each platform is built recursively. The expression data is parsed, normalized and imputed (Huttenhower et al., 2008), and gene IDs are standardized using GeneMapper. Finally, metadata is created for each dataset, combining automatically-derived information from the SOFT files with manually curated tables if available. The final outputs are a standardized and consistent gene expression matrix encoded as tab-delimited text, corresponding metadata, a normalized co-expression network, and an R data file containing an expression set. Other dedicated modules, such as those for interaction data, operate similarly by first downloading and parsing the requested data and directly generating tab-delimited text networks for each dataset (typically separated by taxon ID and publication ID). Standardization of interaction data and metadata are performed as described for the GEO module.

Bottom Line: Although large systematic meta-analyses are among the most effective approaches both for clinical biomarker discovery and for computational inference of biomolecular mechanisms, identifying, acquiring, and integrating relevant experimental results from multiple sources for a given study can be time-consuming and error-prone.To enable efficient and reproducible integration of diverse experimental results, we developed a novel approach for standardized acquisition and analysis of high-throughput and heterogeneous biological data.Finally, we constructed integrated functional interaction networks to compare connectivity of peptide secretion pathways in the model organisms Escherichia coli, Bacillus subtilis, and Pseudomonas aeruginosa.

View Article: PubMed Central - HTML - PubMed

Affiliation: Biostatistics Department, Harvard School of Public Health , Boston, MA , USA ; The Broad Institute of MIT and Harvard , Cambridge, MA , USA.

ABSTRACT
Modern biological research requires rapid, complex, and reproducible integration of multiple experimental results generated both internally and externally (e.g., from public repositories). Although large systematic meta-analyses are among the most effective approaches both for clinical biomarker discovery and for computational inference of biomolecular mechanisms, identifying, acquiring, and integrating relevant experimental results from multiple sources for a given study can be time-consuming and error-prone. To enable efficient and reproducible integration of diverse experimental results, we developed a novel approach for standardized acquisition and analysis of high-throughput and heterogeneous biological data. This allowed, first, novel biomolecular network reconstruction in human prostate cancer, which correctly recovered and extended the NFκB signaling pathway. Next, we investigated host-microbiome interactions. In less than an hour of analysis time, the system retrieved data and integrated six germ-free murine intestinal gene expression datasets to identify the genes most influenced by the gut microbiota, which comprised a set of immune-response and carbohydrate metabolism processes. Finally, we constructed integrated functional interaction networks to compare connectivity of peptide secretion pathways in the model organisms Escherichia coli, Bacillus subtilis, and Pseudomonas aeruginosa.

No MeSH data available.


Related in: MedlinePlus