Limits...
A reproducible approach to high-throughput biological data acquisition and integration.

Börnigen D, Moon YS, Rahnavard G, Waldron L, McIver L, Shafquat A, Franzosa EA, Miropolsky L, Sweeney C, Morgan XC, Garrett WS, Huttenhower C - PeerJ (2015)

Bottom Line: Although large systematic meta-analyses are among the most effective approaches both for clinical biomarker discovery and for computational inference of biomolecular mechanisms, identifying, acquiring, and integrating relevant experimental results from multiple sources for a given study can be time-consuming and error-prone.To enable efficient and reproducible integration of diverse experimental results, we developed a novel approach for standardized acquisition and analysis of high-throughput and heterogeneous biological data.Finally, we constructed integrated functional interaction networks to compare connectivity of peptide secretion pathways in the model organisms Escherichia coli, Bacillus subtilis, and Pseudomonas aeruginosa.

View Article: PubMed Central - HTML - PubMed

Affiliation: Biostatistics Department, Harvard School of Public Health , Boston, MA , USA ; The Broad Institute of MIT and Harvard , Cambridge, MA , USA.

ABSTRACT
Modern biological research requires rapid, complex, and reproducible integration of multiple experimental results generated both internally and externally (e.g., from public repositories). Although large systematic meta-analyses are among the most effective approaches both for clinical biomarker discovery and for computational inference of biomolecular mechanisms, identifying, acquiring, and integrating relevant experimental results from multiple sources for a given study can be time-consuming and error-prone. To enable efficient and reproducible integration of diverse experimental results, we developed a novel approach for standardized acquisition and analysis of high-throughput and heterogeneous biological data. This allowed, first, novel biomolecular network reconstruction in human prostate cancer, which correctly recovered and extended the NFκB signaling pathway. Next, we investigated host-microbiome interactions. In less than an hour of analysis time, the system retrieved data and integrated six germ-free murine intestinal gene expression datasets to identify the genes most influenced by the gut microbiota, which comprised a set of immune-response and carbohydrate metabolism processes. Finally, we constructed integrated functional interaction networks to compare connectivity of peptide secretion pathways in the model organisms Escherichia coli, Bacillus subtilis, and Pseudomonas aeruginosa.

No MeSH data available.


Related in: MedlinePlus

ARepA is an extensible, modular Automated Repository Acquisition system for reproducible biological data acquisition and processing.ARepA is a framework for reproducible biological data mining and analysis. It can retrieve heterogeneous data from multiple public repositories in a uniform environment and format, currently allowing configurable data access for any organism(s) to the Gene Expression Omnibus (GEO) (Barrett et al., 2011), IntAct (Kerrien et al., 2012), BioGRID (Stark et al., 2011), RegulonDB (Gama-Castro et al., 2011), STRING (Szklarczyk et al., 2011), Bacteriome (Su et al., 2008), and MPIDB (Goll et al., 2008) databases. Using ARepA includes the following steps: (i) user input, (ii) data processing, and (iii) output formatting. The input phase is the only aspect of the ARepA pipeline that requires direct user oversight. The user’s input to ARepA can be as simple a list of organisms-of-interest; ARepA then uses this list as a query for recovering interactome network and gene expression data specific to those organisms. Advanced users also have the option of providing custom gene mapping files, metadata, and/or normalization schemes, as well as fine-tuning the list of data sources to be searched. The data processing phase is divided into a series of automated steps in which raw interactome network and gene expression data are downloaded, converted to a common gene-naming scheme, and normalized for between-dataset comparison. During this phase, integrated gene expression data are analyzed for co-expression relationships, which contributes an additional co-expression network to the final network output. All network data are provided for the user as text files, while all expression data and associated metadata are saved as individual text files and as an R data file. The bottom panel illustrates how generated data can be integrated by subsequent network (see prostate cancer and bacterial studies) or expression (see murine differential gene expression) meta-analysis. For example, network integration is a convenient way to combines multiple datasets of different types and sources, such as co-expression, physical gene interactions, regulatory interactions, or posttranslational modification, into one functional network.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4493686&req=5

fig-1: ARepA is an extensible, modular Automated Repository Acquisition system for reproducible biological data acquisition and processing.ARepA is a framework for reproducible biological data mining and analysis. It can retrieve heterogeneous data from multiple public repositories in a uniform environment and format, currently allowing configurable data access for any organism(s) to the Gene Expression Omnibus (GEO) (Barrett et al., 2011), IntAct (Kerrien et al., 2012), BioGRID (Stark et al., 2011), RegulonDB (Gama-Castro et al., 2011), STRING (Szklarczyk et al., 2011), Bacteriome (Su et al., 2008), and MPIDB (Goll et al., 2008) databases. Using ARepA includes the following steps: (i) user input, (ii) data processing, and (iii) output formatting. The input phase is the only aspect of the ARepA pipeline that requires direct user oversight. The user’s input to ARepA can be as simple a list of organisms-of-interest; ARepA then uses this list as a query for recovering interactome network and gene expression data specific to those organisms. Advanced users also have the option of providing custom gene mapping files, metadata, and/or normalization schemes, as well as fine-tuning the list of data sources to be searched. The data processing phase is divided into a series of automated steps in which raw interactome network and gene expression data are downloaded, converted to a common gene-naming scheme, and normalized for between-dataset comparison. During this phase, integrated gene expression data are analyzed for co-expression relationships, which contributes an additional co-expression network to the final network output. All network data are provided for the user as text files, while all expression data and associated metadata are saved as individual text files and as an R data file. The bottom panel illustrates how generated data can be integrated by subsequent network (see prostate cancer and bacterial studies) or expression (see murine differential gene expression) meta-analysis. For example, network integration is a convenient way to combines multiple datasets of different types and sources, such as co-expression, physical gene interactions, regulatory interactions, or posttranslational modification, into one functional network.

Mentions: Each of these three studies required rapid, reproducible, and diverse data collection and integration, and thus demonstrates the utility of ARepA for any lab wishing to combine its own data with public data repositories. We provide an implementation of ARepA that can access ’omics data from multiple data sources in a consistent and robust fashion. Furthermore, it can be configured to process data only from certain sources, of specified types, or from particular organisms of interest, depending on the user’s interest and requirements. Currently, it supports seven repositories: the Gene Expression Omnibus (GEO) (Barrett et al., 2011; Barrett et al., 2013), IntAct (Kerrien et al., 2012; Orchard et al., 2014), BioGRID (Stark et al., 2011), RegulonDB (Gama-Castro et al., 2011), STRING (Szklarczyk et al., 2011), Bacteriome (Su et al., 2008), and MPIDB (Goll et al., 2008) (Table 1), from which data are retrieved, processed and standardized for any organism of interest. At present, this produces more than 2 M interactions from ∼600 K physical interactions (Bacteriome, BioGRID, Intact, and MPIDB), ∼4 K regulatory interactions (RegulonDB) and 1.5 M functional associations (STRING), as well as more than 2.7 K expression data sets covering more than 800 K samples (GEO). When retrieving data from six model organism, ARepA currently produces >630 K interactions and >15 K gene expression data sets (>530 K samples), accompanied by corresponding metadata and derived molecular networks (Table 2 and Fig. 1).


A reproducible approach to high-throughput biological data acquisition and integration.

Börnigen D, Moon YS, Rahnavard G, Waldron L, McIver L, Shafquat A, Franzosa EA, Miropolsky L, Sweeney C, Morgan XC, Garrett WS, Huttenhower C - PeerJ (2015)

ARepA is an extensible, modular Automated Repository Acquisition system for reproducible biological data acquisition and processing.ARepA is a framework for reproducible biological data mining and analysis. It can retrieve heterogeneous data from multiple public repositories in a uniform environment and format, currently allowing configurable data access for any organism(s) to the Gene Expression Omnibus (GEO) (Barrett et al., 2011), IntAct (Kerrien et al., 2012), BioGRID (Stark et al., 2011), RegulonDB (Gama-Castro et al., 2011), STRING (Szklarczyk et al., 2011), Bacteriome (Su et al., 2008), and MPIDB (Goll et al., 2008) databases. Using ARepA includes the following steps: (i) user input, (ii) data processing, and (iii) output formatting. The input phase is the only aspect of the ARepA pipeline that requires direct user oversight. The user’s input to ARepA can be as simple a list of organisms-of-interest; ARepA then uses this list as a query for recovering interactome network and gene expression data specific to those organisms. Advanced users also have the option of providing custom gene mapping files, metadata, and/or normalization schemes, as well as fine-tuning the list of data sources to be searched. The data processing phase is divided into a series of automated steps in which raw interactome network and gene expression data are downloaded, converted to a common gene-naming scheme, and normalized for between-dataset comparison. During this phase, integrated gene expression data are analyzed for co-expression relationships, which contributes an additional co-expression network to the final network output. All network data are provided for the user as text files, while all expression data and associated metadata are saved as individual text files and as an R data file. The bottom panel illustrates how generated data can be integrated by subsequent network (see prostate cancer and bacterial studies) or expression (see murine differential gene expression) meta-analysis. For example, network integration is a convenient way to combines multiple datasets of different types and sources, such as co-expression, physical gene interactions, regulatory interactions, or posttranslational modification, into one functional network.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4493686&req=5

fig-1: ARepA is an extensible, modular Automated Repository Acquisition system for reproducible biological data acquisition and processing.ARepA is a framework for reproducible biological data mining and analysis. It can retrieve heterogeneous data from multiple public repositories in a uniform environment and format, currently allowing configurable data access for any organism(s) to the Gene Expression Omnibus (GEO) (Barrett et al., 2011), IntAct (Kerrien et al., 2012), BioGRID (Stark et al., 2011), RegulonDB (Gama-Castro et al., 2011), STRING (Szklarczyk et al., 2011), Bacteriome (Su et al., 2008), and MPIDB (Goll et al., 2008) databases. Using ARepA includes the following steps: (i) user input, (ii) data processing, and (iii) output formatting. The input phase is the only aspect of the ARepA pipeline that requires direct user oversight. The user’s input to ARepA can be as simple a list of organisms-of-interest; ARepA then uses this list as a query for recovering interactome network and gene expression data specific to those organisms. Advanced users also have the option of providing custom gene mapping files, metadata, and/or normalization schemes, as well as fine-tuning the list of data sources to be searched. The data processing phase is divided into a series of automated steps in which raw interactome network and gene expression data are downloaded, converted to a common gene-naming scheme, and normalized for between-dataset comparison. During this phase, integrated gene expression data are analyzed for co-expression relationships, which contributes an additional co-expression network to the final network output. All network data are provided for the user as text files, while all expression data and associated metadata are saved as individual text files and as an R data file. The bottom panel illustrates how generated data can be integrated by subsequent network (see prostate cancer and bacterial studies) or expression (see murine differential gene expression) meta-analysis. For example, network integration is a convenient way to combines multiple datasets of different types and sources, such as co-expression, physical gene interactions, regulatory interactions, or posttranslational modification, into one functional network.
Mentions: Each of these three studies required rapid, reproducible, and diverse data collection and integration, and thus demonstrates the utility of ARepA for any lab wishing to combine its own data with public data repositories. We provide an implementation of ARepA that can access ’omics data from multiple data sources in a consistent and robust fashion. Furthermore, it can be configured to process data only from certain sources, of specified types, or from particular organisms of interest, depending on the user’s interest and requirements. Currently, it supports seven repositories: the Gene Expression Omnibus (GEO) (Barrett et al., 2011; Barrett et al., 2013), IntAct (Kerrien et al., 2012; Orchard et al., 2014), BioGRID (Stark et al., 2011), RegulonDB (Gama-Castro et al., 2011), STRING (Szklarczyk et al., 2011), Bacteriome (Su et al., 2008), and MPIDB (Goll et al., 2008) (Table 1), from which data are retrieved, processed and standardized for any organism of interest. At present, this produces more than 2 M interactions from ∼600 K physical interactions (Bacteriome, BioGRID, Intact, and MPIDB), ∼4 K regulatory interactions (RegulonDB) and 1.5 M functional associations (STRING), as well as more than 2.7 K expression data sets covering more than 800 K samples (GEO). When retrieving data from six model organism, ARepA currently produces >630 K interactions and >15 K gene expression data sets (>530 K samples), accompanied by corresponding metadata and derived molecular networks (Table 2 and Fig. 1).

Bottom Line: Although large systematic meta-analyses are among the most effective approaches both for clinical biomarker discovery and for computational inference of biomolecular mechanisms, identifying, acquiring, and integrating relevant experimental results from multiple sources for a given study can be time-consuming and error-prone.To enable efficient and reproducible integration of diverse experimental results, we developed a novel approach for standardized acquisition and analysis of high-throughput and heterogeneous biological data.Finally, we constructed integrated functional interaction networks to compare connectivity of peptide secretion pathways in the model organisms Escherichia coli, Bacillus subtilis, and Pseudomonas aeruginosa.

View Article: PubMed Central - HTML - PubMed

Affiliation: Biostatistics Department, Harvard School of Public Health , Boston, MA , USA ; The Broad Institute of MIT and Harvard , Cambridge, MA , USA.

ABSTRACT
Modern biological research requires rapid, complex, and reproducible integration of multiple experimental results generated both internally and externally (e.g., from public repositories). Although large systematic meta-analyses are among the most effective approaches both for clinical biomarker discovery and for computational inference of biomolecular mechanisms, identifying, acquiring, and integrating relevant experimental results from multiple sources for a given study can be time-consuming and error-prone. To enable efficient and reproducible integration of diverse experimental results, we developed a novel approach for standardized acquisition and analysis of high-throughput and heterogeneous biological data. This allowed, first, novel biomolecular network reconstruction in human prostate cancer, which correctly recovered and extended the NFκB signaling pathway. Next, we investigated host-microbiome interactions. In less than an hour of analysis time, the system retrieved data and integrated six germ-free murine intestinal gene expression datasets to identify the genes most influenced by the gut microbiota, which comprised a set of immune-response and carbohydrate metabolism processes. Finally, we constructed integrated functional interaction networks to compare connectivity of peptide secretion pathways in the model organisms Escherichia coli, Bacillus subtilis, and Pseudomonas aeruginosa.

No MeSH data available.


Related in: MedlinePlus