Limits...
SigWin-detector: a Grid-enabled workflow for discovering enriched windows of genomic features related to DNA sequences.

Inda MA, van Batenburg MF, Roos M, Belloum AS, Vasunin D, Wibisono A, van Kampen AH, Breit TM - BMC Res Notes (2008)

Bottom Line: Here we apply an e-Science approach to develop SigWin-detector, a workflow-based tool that can detect significantly enriched windows of (genomic) features in a (DNA) sequence in a fast and reproducible way.As we show with the results from analyses in the biological use case on RIDGEs, SigWin-detector is an efficient and reusable Grid-based tool for discovering windows enriched for features of a particular type in any sequence of values.Thus, SigWin-detector provides the proof-of-principle for the modular e-Science based concept of integrative bioinformatics experimentation.

View Article: PubMed Central - HTML - PubMed

Affiliation: Integrative Bioinformatics Unit, Swammerdam Institute for Life Sciences, Faculty of Science, University of Amsterdam, PO Box 94062, 1090 GB Amsterdam, The Netherlands. inda@science.uva.nl

ABSTRACT

Background: Chromosome location is often used as a scaffold to organize genomic information in both the living cell and molecular biological research. Thus, ever-increasing amounts of data about genomic features are stored in public databases and can be readily visualized by genome browsers. To perform in silico experimentation conveniently with this genomics data, biologists need tools to process and compare datasets routinely and explore the obtained results interactively. The complexity of such experimentation requires these tools to be based on an e-Science approach, hence generic, modular, and reusable. A virtual laboratory environment with workflows, workflow management systems, and Grid computation are therefore essential.

Findings: Here we apply an e-Science approach to develop SigWin-detector, a workflow-based tool that can detect significantly enriched windows of (genomic) features in a (DNA) sequence in a fast and reproducible way. For proof-of-principle, we utilize a biological use case to detect regions of increased and decreased gene expression (RIDGEs and anti-RIDGEs) in human transcriptome maps. We improved the original method for RIDGE detection by replacing the costly step of estimation by random sampling with a faster analytical formula for computing the distribution of the hypothesis being tested and by developing a new algorithm for computing moving medians. SigWin-detector was developed using the WS-VLAM workflow management system and consists of several reusable modules that are linked together in a basic workflow. The configuration of this basic workflow can be adapted to satisfy the requirements of the specific in silico experiment.

Conclusion: As we show with the results from analyses in the biological use case on RIDGEs, SigWin-detector is an efficient and reusable Grid-based tool for discovering windows enriched for features of a particular type in any sequence of values. Thus, SigWin-detector provides the proof-of-principle for the modular e-Science based concept of integrative bioinformatics experimentation.

No MeSH data available.


Related in: MedlinePlus

Using a mmFDR method to detect RIDGEs in a human transcriptome map. Schematic representation of the moving median false discovery rate (mmFDR) procedure identifying regions of high and low density of gene expression (RIDGEs and anti-RIDGEs, respectively) [4]. (A) Input sequence, a human transcriptome map (HTM), i.e., expression values of genes ordered by their chromosome location (cyan; chromosome 6). (B) mm(w), moving medians of the HTM for a given window size S. (C) Determination of the high and low mmFDR thresholds at a given level α: The high threshold mk is the smallest gene expression value for which the , here f(m) is the theoretical probability distribution of mm(w), and g(m) is the observed distribution of mm(w). (In [4], f(m) is estimated by simple sampling). Similarly, the low threshold mj is the largest gene expression value for which . (D) Selection of significant windows in chromosome 6: RIDGEs (in red) all windows for which the median gene expression is higher than or equal to mk; anti-RIDGEs (in blue) all windows for which the median gene expression is lower than or equal to mj. (E) Output RIDGEOGRAM of chromosome 6. Each row (y-axis) in the RIDGEOGRAM represents a window size, ranging from S = 3 to S = M (the number of genes on the chromosome). Each column (x-axis) represents a sliding window number, ranging from w = S/2 to w = M-S/2 (hence the triangular form). Color is used to mark window medians significantly above (red) or below (blue) the genome-wide median. The scheme shows median expression data for window size S = 69 and FDR thresholds level α = 5%.
© Copyright Policy - open-access
Related In: Results  -  Collection


getmorefigures.php?uid=PMC2533338&req=5

Figure 1: Using a mmFDR method to detect RIDGEs in a human transcriptome map. Schematic representation of the moving median false discovery rate (mmFDR) procedure identifying regions of high and low density of gene expression (RIDGEs and anti-RIDGEs, respectively) [4]. (A) Input sequence, a human transcriptome map (HTM), i.e., expression values of genes ordered by their chromosome location (cyan; chromosome 6). (B) mm(w), moving medians of the HTM for a given window size S. (C) Determination of the high and low mmFDR thresholds at a given level α: The high threshold mk is the smallest gene expression value for which the , here f(m) is the theoretical probability distribution of mm(w), and g(m) is the observed distribution of mm(w). (In [4], f(m) is estimated by simple sampling). Similarly, the low threshold mj is the largest gene expression value for which . (D) Selection of significant windows in chromosome 6: RIDGEs (in red) all windows for which the median gene expression is higher than or equal to mk; anti-RIDGEs (in blue) all windows for which the median gene expression is lower than or equal to mj. (E) Output RIDGEOGRAM of chromosome 6. Each row (y-axis) in the RIDGEOGRAM represents a window size, ranging from S = 3 to S = M (the number of genes on the chromosome). Each column (x-axis) represents a sliding window number, ranging from w = S/2 to w = M-S/2 (hence the triangular form). Color is used to mark window medians significantly above (red) or below (blue) the genome-wide median. The scheme shows median expression data for window size S = 69 and FDR thresholds level α = 5%.

Mentions: Versteeg et al. [4] identified clusters where the median expression level of the genes involved is significantly higher than expected (RIDGEs), using a moving median false discovery rate (mmFDR) procedure (Figure 1). The mmFDR procedure identifies RIDGEs by testing the input gene-expression against the hypothesis that the position of the genes on the chromosomes does not affect their expression levels. This same procedure can be used to identify significant windows (i.e., windows in the input sequence that have a median value that deviates significantly from expected, if assumed that the ordering of the numbers in the input sequence is random) related to any genomic feature mapped to DNA sequences. In an even wider scope, it can also be used to identify significant windows in any sequence of numbers.


SigWin-detector: a Grid-enabled workflow for discovering enriched windows of genomic features related to DNA sequences.

Inda MA, van Batenburg MF, Roos M, Belloum AS, Vasunin D, Wibisono A, van Kampen AH, Breit TM - BMC Res Notes (2008)

Using a mmFDR method to detect RIDGEs in a human transcriptome map. Schematic representation of the moving median false discovery rate (mmFDR) procedure identifying regions of high and low density of gene expression (RIDGEs and anti-RIDGEs, respectively) [4]. (A) Input sequence, a human transcriptome map (HTM), i.e., expression values of genes ordered by their chromosome location (cyan; chromosome 6). (B) mm(w), moving medians of the HTM for a given window size S. (C) Determination of the high and low mmFDR thresholds at a given level α: The high threshold mk is the smallest gene expression value for which the , here f(m) is the theoretical probability distribution of mm(w), and g(m) is the observed distribution of mm(w). (In [4], f(m) is estimated by simple sampling). Similarly, the low threshold mj is the largest gene expression value for which . (D) Selection of significant windows in chromosome 6: RIDGEs (in red) all windows for which the median gene expression is higher than or equal to mk; anti-RIDGEs (in blue) all windows for which the median gene expression is lower than or equal to mj. (E) Output RIDGEOGRAM of chromosome 6. Each row (y-axis) in the RIDGEOGRAM represents a window size, ranging from S = 3 to S = M (the number of genes on the chromosome). Each column (x-axis) represents a sliding window number, ranging from w = S/2 to w = M-S/2 (hence the triangular form). Color is used to mark window medians significantly above (red) or below (blue) the genome-wide median. The scheme shows median expression data for window size S = 69 and FDR thresholds level α = 5%.
© Copyright Policy - open-access
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC2533338&req=5

Figure 1: Using a mmFDR method to detect RIDGEs in a human transcriptome map. Schematic representation of the moving median false discovery rate (mmFDR) procedure identifying regions of high and low density of gene expression (RIDGEs and anti-RIDGEs, respectively) [4]. (A) Input sequence, a human transcriptome map (HTM), i.e., expression values of genes ordered by their chromosome location (cyan; chromosome 6). (B) mm(w), moving medians of the HTM for a given window size S. (C) Determination of the high and low mmFDR thresholds at a given level α: The high threshold mk is the smallest gene expression value for which the , here f(m) is the theoretical probability distribution of mm(w), and g(m) is the observed distribution of mm(w). (In [4], f(m) is estimated by simple sampling). Similarly, the low threshold mj is the largest gene expression value for which . (D) Selection of significant windows in chromosome 6: RIDGEs (in red) all windows for which the median gene expression is higher than or equal to mk; anti-RIDGEs (in blue) all windows for which the median gene expression is lower than or equal to mj. (E) Output RIDGEOGRAM of chromosome 6. Each row (y-axis) in the RIDGEOGRAM represents a window size, ranging from S = 3 to S = M (the number of genes on the chromosome). Each column (x-axis) represents a sliding window number, ranging from w = S/2 to w = M-S/2 (hence the triangular form). Color is used to mark window medians significantly above (red) or below (blue) the genome-wide median. The scheme shows median expression data for window size S = 69 and FDR thresholds level α = 5%.
Mentions: Versteeg et al. [4] identified clusters where the median expression level of the genes involved is significantly higher than expected (RIDGEs), using a moving median false discovery rate (mmFDR) procedure (Figure 1). The mmFDR procedure identifies RIDGEs by testing the input gene-expression against the hypothesis that the position of the genes on the chromosomes does not affect their expression levels. This same procedure can be used to identify significant windows (i.e., windows in the input sequence that have a median value that deviates significantly from expected, if assumed that the ordering of the numbers in the input sequence is random) related to any genomic feature mapped to DNA sequences. In an even wider scope, it can also be used to identify significant windows in any sequence of numbers.

Bottom Line: Here we apply an e-Science approach to develop SigWin-detector, a workflow-based tool that can detect significantly enriched windows of (genomic) features in a (DNA) sequence in a fast and reproducible way.As we show with the results from analyses in the biological use case on RIDGEs, SigWin-detector is an efficient and reusable Grid-based tool for discovering windows enriched for features of a particular type in any sequence of values.Thus, SigWin-detector provides the proof-of-principle for the modular e-Science based concept of integrative bioinformatics experimentation.

View Article: PubMed Central - HTML - PubMed

Affiliation: Integrative Bioinformatics Unit, Swammerdam Institute for Life Sciences, Faculty of Science, University of Amsterdam, PO Box 94062, 1090 GB Amsterdam, The Netherlands. inda@science.uva.nl

ABSTRACT

Background: Chromosome location is often used as a scaffold to organize genomic information in both the living cell and molecular biological research. Thus, ever-increasing amounts of data about genomic features are stored in public databases and can be readily visualized by genome browsers. To perform in silico experimentation conveniently with this genomics data, biologists need tools to process and compare datasets routinely and explore the obtained results interactively. The complexity of such experimentation requires these tools to be based on an e-Science approach, hence generic, modular, and reusable. A virtual laboratory environment with workflows, workflow management systems, and Grid computation are therefore essential.

Findings: Here we apply an e-Science approach to develop SigWin-detector, a workflow-based tool that can detect significantly enriched windows of (genomic) features in a (DNA) sequence in a fast and reproducible way. For proof-of-principle, we utilize a biological use case to detect regions of increased and decreased gene expression (RIDGEs and anti-RIDGEs) in human transcriptome maps. We improved the original method for RIDGE detection by replacing the costly step of estimation by random sampling with a faster analytical formula for computing the distribution of the hypothesis being tested and by developing a new algorithm for computing moving medians. SigWin-detector was developed using the WS-VLAM workflow management system and consists of several reusable modules that are linked together in a basic workflow. The configuration of this basic workflow can be adapted to satisfy the requirements of the specific in silico experiment.

Conclusion: As we show with the results from analyses in the biological use case on RIDGEs, SigWin-detector is an efficient and reusable Grid-based tool for discovering windows enriched for features of a particular type in any sequence of values. Thus, SigWin-detector provides the proof-of-principle for the modular e-Science based concept of integrative bioinformatics experimentation.

No MeSH data available.


Related in: MedlinePlus