Limits...
glbase: a framework for combining, analyzing and displaying heterogeneous genomic and high-throughput sequencing data.

Hutchins AP, Jauch R, Dyla M, Miranda-Saavedra D - Cell Regen (Lond) (2014)

Bottom Line: Genomic datasets and the tools to analyze them have proliferated at an astonishing rate.Here we present glbase, a framework that uses a flexible set of descriptors that can quickly parse non-binary data files. glbase includes many functions to intersect two lists of data, including operations on genomic interval data and support for the efficient random access to huge genomic data files.Many glbase functions can produce graphical outputs, including scatter plots, heatmaps, boxplots and other common analytical displays of high-throughput data such as RNA-seq, ChIP-seq and microarray expression data. glbase is designed to rapidly bring biological data into a Python-based analytical environment to facilitate analysis and data processing.

View Article: PubMed Central - PubMed

Affiliation: Key Laboratory of Regenerative Biology, South China Institute for Stem Cell Biology and Regenerative Medicine, Guangzhou Institutes of Biomedicine and Health, Chinese Academy of Sciences, Guangzhou, 510530 China.

ABSTRACT
Genomic datasets and the tools to analyze them have proliferated at an astonishing rate. However, such tools are often poorly integrated with each other: each program typically produces its own custom output in a variety of non-standard file formats. Here we present glbase, a framework that uses a flexible set of descriptors that can quickly parse non-binary data files. glbase includes many functions to intersect two lists of data, including operations on genomic interval data and support for the efficient random access to huge genomic data files. Many glbase functions can produce graphical outputs, including scatter plots, heatmaps, boxplots and other common analytical displays of high-throughput data such as RNA-seq, ChIP-seq and microarray expression data. glbase is designed to rapidly bring biological data into a Python-based analytical environment to facilitate analysis and data processing. In summary, glbase is a flexible and multifunctional toolkit that allows the combination and analysis of high-throughput data (especially next-generation sequencing and genome-wide data), and which has been instrumental in the analysis of complex data sets. glbase is freely available at http://bitbucket.org/oaxiom/glbase/.

No MeSH data available.


Example graphical output from glbase. Code and raw data can be found in the glbase directory (glbase/examples/). (A) Frequency of the STAT3 DNA-binding word (‘TTCnnnGAA’) in a list of STAT3 ChIP-seq binding sites, compared to a random selected background from the control ChIP-seq sample. (B) Heatmap of top 20 and bottom 20 up- and down-regulated transcription factors when macrophages are stimulated with IL-10. (C) Scatter plot of RNA-seq data (D) Genomic distribution of STAT3 binding in IL-10 stimulated macrophages. (E) Average phastCons evolutionary conservation score around a list of Sox2-Oct4 ChIP-seq binding peaks. (F) Heatmap of p300 recruitment in mouse ES cells for a list of Sox2-Oct4 ChIP-seq binding peaks. Raw data comes from the GEO accessions GSE31531 [14], the ENCODE project [16], GSE11431 [15] and the phastCons measure of evolutionary conservation [12]. Transcription factor annotation was based on the DNA-binding domain database [17].
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC4230833&req=5

Fig2: Example graphical output from glbase. Code and raw data can be found in the glbase directory (glbase/examples/). (A) Frequency of the STAT3 DNA-binding word (‘TTCnnnGAA’) in a list of STAT3 ChIP-seq binding sites, compared to a random selected background from the control ChIP-seq sample. (B) Heatmap of top 20 and bottom 20 up- and down-regulated transcription factors when macrophages are stimulated with IL-10. (C) Scatter plot of RNA-seq data (D) Genomic distribution of STAT3 binding in IL-10 stimulated macrophages. (E) Average phastCons evolutionary conservation score around a list of Sox2-Oct4 ChIP-seq binding peaks. (F) Heatmap of p300 recruitment in mouse ES cells for a list of Sox2-Oct4 ChIP-seq binding peaks. Raw data comes from the GEO accessions GSE31531 [14], the ENCODE project [16], GSE11431 [15] and the phastCons measure of evolutionary conservation [12]. Transcription factor annotation was based on the DNA-binding domain database [17].

Mentions: In addition to acting as a universal file format converter, a second major utility of glbase is to act as the ‘glue’ between up-stream and down-stream analysis tools, for instance to get from a list of ChIP-seq peaks and gene expression values to heatmaps, gene-peak associations and other informative plots. As an example of usage, glbase includes a tool for finding words in FASTA-formatted DNA sequences: Figure 2A shows an example of the frequency of the STAT3 DNA-binding motif (word) ‘TTCnnnGAA’ in a list of STAT3 ChIP-seq binding data [14]. For any key in a genelist, its frequency can be measured with a pie chart. glbase can also deal with expression data through the derived genelist-like object ‘expression’ that contains methods for drawing heatmaps (Figure 2B) as well as histograms, boxplots, scatter plots (Figure 2C) and the ability to transform the expression data (fold-change, log-transform, normalize, etc.). Expression data and ChIP-seq data can be combined to produce density maps of ChIP-seq binding against changes in gene expression or to annotate scatter plots. ChIP-seq data can be compared against any set of genomic annotations, for example gene transcription start sites, to produce a breakdown of distances from the binding site to the transcription start site. Figure 2D shows the distribution of STAT3 binding sites in IL-10 stimulated macrophages relative to the nearest transcription start site [14]. Phylogenetic data (e.g. phastCons scores of evolutionary conservation, any type of numeric data can be used) can be loaded into an SQL database by glbase and then pileups can be visualized (Figure 2E). Similarly, sequence reads can be converted by glbase into an SQL database for efficient retrieval of the reads across arbitrary genomic locations. Figure 2F shows a heatmap of the density of sequence tag reads from a p300 ChIP-seq library centered on a list of Sox2-Oct4 bound region in embryonic stem cells [15, 16].Figure 2


glbase: a framework for combining, analyzing and displaying heterogeneous genomic and high-throughput sequencing data.

Hutchins AP, Jauch R, Dyla M, Miranda-Saavedra D - Cell Regen (Lond) (2014)

Example graphical output from glbase. Code and raw data can be found in the glbase directory (glbase/examples/). (A) Frequency of the STAT3 DNA-binding word (‘TTCnnnGAA’) in a list of STAT3 ChIP-seq binding sites, compared to a random selected background from the control ChIP-seq sample. (B) Heatmap of top 20 and bottom 20 up- and down-regulated transcription factors when macrophages are stimulated with IL-10. (C) Scatter plot of RNA-seq data (D) Genomic distribution of STAT3 binding in IL-10 stimulated macrophages. (E) Average phastCons evolutionary conservation score around a list of Sox2-Oct4 ChIP-seq binding peaks. (F) Heatmap of p300 recruitment in mouse ES cells for a list of Sox2-Oct4 ChIP-seq binding peaks. Raw data comes from the GEO accessions GSE31531 [14], the ENCODE project [16], GSE11431 [15] and the phastCons measure of evolutionary conservation [12]. Transcription factor annotation was based on the DNA-binding domain database [17].
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC4230833&req=5

Fig2: Example graphical output from glbase. Code and raw data can be found in the glbase directory (glbase/examples/). (A) Frequency of the STAT3 DNA-binding word (‘TTCnnnGAA’) in a list of STAT3 ChIP-seq binding sites, compared to a random selected background from the control ChIP-seq sample. (B) Heatmap of top 20 and bottom 20 up- and down-regulated transcription factors when macrophages are stimulated with IL-10. (C) Scatter plot of RNA-seq data (D) Genomic distribution of STAT3 binding in IL-10 stimulated macrophages. (E) Average phastCons evolutionary conservation score around a list of Sox2-Oct4 ChIP-seq binding peaks. (F) Heatmap of p300 recruitment in mouse ES cells for a list of Sox2-Oct4 ChIP-seq binding peaks. Raw data comes from the GEO accessions GSE31531 [14], the ENCODE project [16], GSE11431 [15] and the phastCons measure of evolutionary conservation [12]. Transcription factor annotation was based on the DNA-binding domain database [17].
Mentions: In addition to acting as a universal file format converter, a second major utility of glbase is to act as the ‘glue’ between up-stream and down-stream analysis tools, for instance to get from a list of ChIP-seq peaks and gene expression values to heatmaps, gene-peak associations and other informative plots. As an example of usage, glbase includes a tool for finding words in FASTA-formatted DNA sequences: Figure 2A shows an example of the frequency of the STAT3 DNA-binding motif (word) ‘TTCnnnGAA’ in a list of STAT3 ChIP-seq binding data [14]. For any key in a genelist, its frequency can be measured with a pie chart. glbase can also deal with expression data through the derived genelist-like object ‘expression’ that contains methods for drawing heatmaps (Figure 2B) as well as histograms, boxplots, scatter plots (Figure 2C) and the ability to transform the expression data (fold-change, log-transform, normalize, etc.). Expression data and ChIP-seq data can be combined to produce density maps of ChIP-seq binding against changes in gene expression or to annotate scatter plots. ChIP-seq data can be compared against any set of genomic annotations, for example gene transcription start sites, to produce a breakdown of distances from the binding site to the transcription start site. Figure 2D shows the distribution of STAT3 binding sites in IL-10 stimulated macrophages relative to the nearest transcription start site [14]. Phylogenetic data (e.g. phastCons scores of evolutionary conservation, any type of numeric data can be used) can be loaded into an SQL database by glbase and then pileups can be visualized (Figure 2E). Similarly, sequence reads can be converted by glbase into an SQL database for efficient retrieval of the reads across arbitrary genomic locations. Figure 2F shows a heatmap of the density of sequence tag reads from a p300 ChIP-seq library centered on a list of Sox2-Oct4 bound region in embryonic stem cells [15, 16].Figure 2

Bottom Line: Genomic datasets and the tools to analyze them have proliferated at an astonishing rate.Here we present glbase, a framework that uses a flexible set of descriptors that can quickly parse non-binary data files. glbase includes many functions to intersect two lists of data, including operations on genomic interval data and support for the efficient random access to huge genomic data files.Many glbase functions can produce graphical outputs, including scatter plots, heatmaps, boxplots and other common analytical displays of high-throughput data such as RNA-seq, ChIP-seq and microarray expression data. glbase is designed to rapidly bring biological data into a Python-based analytical environment to facilitate analysis and data processing.

View Article: PubMed Central - PubMed

Affiliation: Key Laboratory of Regenerative Biology, South China Institute for Stem Cell Biology and Regenerative Medicine, Guangzhou Institutes of Biomedicine and Health, Chinese Academy of Sciences, Guangzhou, 510530 China.

ABSTRACT
Genomic datasets and the tools to analyze them have proliferated at an astonishing rate. However, such tools are often poorly integrated with each other: each program typically produces its own custom output in a variety of non-standard file formats. Here we present glbase, a framework that uses a flexible set of descriptors that can quickly parse non-binary data files. glbase includes many functions to intersect two lists of data, including operations on genomic interval data and support for the efficient random access to huge genomic data files. Many glbase functions can produce graphical outputs, including scatter plots, heatmaps, boxplots and other common analytical displays of high-throughput data such as RNA-seq, ChIP-seq and microarray expression data. glbase is designed to rapidly bring biological data into a Python-based analytical environment to facilitate analysis and data processing. In summary, glbase is a flexible and multifunctional toolkit that allows the combination and analysis of high-throughput data (especially next-generation sequencing and genome-wide data), and which has been instrumental in the analysis of complex data sets. glbase is freely available at http://bitbucket.org/oaxiom/glbase/.

No MeSH data available.