Limits...
ClusTrack: feature extraction and similarity measures for clustering of genome-wide data sets.

Rydbeck H, Sandve GK, Ferkingstad E, Simovski B, Rye M, Hovig E - PLoS ONE (2015)

Bottom Line: We here introduce a general methodology for clustering data sets of coordinates relative to a genome assembly, i.e. genomic tracks.An implementation of the methodology is provided through a tool, ClusTrack, which allows fine-tuned clustering analyses to be specified through a web-based interface.The majority of samples form meaningful subclusters, confirming that the definitions of features and similarity capture biological, rather than technical, variation between the genomic tracks.

View Article: PubMed Central - PubMed

Affiliation: Department of Informatics, University of Oslo, Oslo, Norway; Department of Tumour Biology, The Norwegian Radium Hospital, Oslo University Hospital, Oslo, Norway.

ABSTRACT
Clustering is a popular technique for explorative analysis of data, as it can reveal subgroupings and similarities between data in an unsupervised manner. While clustering is routinely applied to gene expression data, there is a lack of appropriate general methodology for clustering of sequence-level genomic and epigenomic data, e.g. ChIP-based data. We here introduce a general methodology for clustering data sets of coordinates relative to a genome assembly, i.e. genomic tracks. By defining appropriate feature extraction approaches and similarity measures, we allow biologically meaningful clustering to be performed for genomic tracks using standard clustering algorithms. An implementation of the methodology is provided through a tool, ClusTrack, which allows fine-tuned clustering analyses to be specified through a web-based interface. We apply our methods to the clustering of occupancy of the H3K4me1 histone modification in samples from a range of different cell types. The majority of samples form meaningful subclusters, confirming that the definitions of features and similarity capture biological, rather than technical, variation between the genomic tracks. Input data and results are available, and can be reproduced, through a Galaxy Pages document at http://hyperbrowser.uio.no/hb/u/hb-superuser/p/clustrack. The clustering functionality is available as a Galaxy tool, under the menu option "Specialized analyzis of tracks", and the submenu option "Cluster tracks based on genome level similarity", at the Genomic HyperBrowser server: http://hyperbrowser.uio.no/hb/.

Show MeSH

Related in: MedlinePlus

Heat map plot for H3K4me1 track clustering, using “Similarity of relations to other sets of genomic features” and 14 Gene Ontology (GO) based reference tracks.The value of a specific cell in the heat map represents the relative aggregated coverage of the corresponding H3K4me1 track (as specified in the row heading) in gene regions associated with a corresponding GO term (as specified in the column heading). The rows of the heat map correspond directly to the feature vectors used to calculate the dendrogram in Fig 3.
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4400084&req=5

pone.0123261.g004: Heat map plot for H3K4me1 track clustering, using “Similarity of relations to other sets of genomic features” and 14 Gene Ontology (GO) based reference tracks.The value of a specific cell in the heat map represents the relative aggregated coverage of the corresponding H3K4me1 track (as specified in the row heading) in gene regions associated with a corresponding GO term (as specified in the column heading). The rows of the heat map correspond directly to the feature vectors used to calculate the dendrogram in Fig 3.

Mentions: By using tracks representing genes associated to various gene ontology terms as reference tracks for the “Similarity of relations to other sets of genomic features” case, the H3K4me1 tracks are clustered based on functional enrichment. The dendrogram and heat map obtained when using the 14 selected reference tracks are shown in Figs 3 and 4, respectively. It can be seen that all samples from similar tissues cluster together. Changing the algorithm for calculating distance from euclidian to manhattan och maximum did not affect the structure of the dendrogram. The three reference tracks that contributed the most to the separation of the clustered tracks are “axonogenesis”, “caspase activation” and “cytokines”. Brain cells, and also liver cells to a lesser degree, have a higher H3K4me1 proportional occupancy in genes associated with the axonogenesis (nerve cell growth) term than the other cell types. Further, one can observe that genes associated with caspase activation is H3K4me1 enriched in muscle cells. This makes biological sense, since caspases are involved in apoptosis, which can be active in muscles as response to starvation and inflammation, and can be indicative of muscle differentiation [19]. Finally, genes associated with the term cytokines (immunomodulating agents, such as interleukins and interferons) are enriched by H3K4me1 in immune cells. The dendrogram of the batch script example of this case is shown in Supplementary S3 Fig. Supplementary S2 File lists the subclusters obtained by dividing the dendrogram of Supplementary S3 Fig based on the height of the branches. Fetal and adult brain samples form two subclusters just as they do in S2 Fig, but the relative distance between the two subclusters is smaller. Most of the subclusters found in Supplementary S1 File are present also in the dendrogram generated through this way of clustering. By comparing the Supplementary S1 and S2 Files, one can see that the fetal brain samples are found in subcluster 2 in both files, the mesenchymal stem cells, breast and fetal lung samples are found in subcluster 3, while the adipose tissue from subcluster 4 and the Mobilized CD34 Primary Cells from subcluster 6 in S1 File are found in subcluster 4 of S2 File. The adult brain samples are found in subclusters 5 and 8, respectively. The immune cells are found in subclusters 7 and 9, respectively. In S2 File, liver samples form their own subcluster 7. The “Similarity of relations to other sets of genomic features” generates 12 subclusters with many samples, as compared to 9 subclusters for “Similarity of positional distribution along the genome”.


ClusTrack: feature extraction and similarity measures for clustering of genome-wide data sets.

Rydbeck H, Sandve GK, Ferkingstad E, Simovski B, Rye M, Hovig E - PLoS ONE (2015)

Heat map plot for H3K4me1 track clustering, using “Similarity of relations to other sets of genomic features” and 14 Gene Ontology (GO) based reference tracks.The value of a specific cell in the heat map represents the relative aggregated coverage of the corresponding H3K4me1 track (as specified in the row heading) in gene regions associated with a corresponding GO term (as specified in the column heading). The rows of the heat map correspond directly to the feature vectors used to calculate the dendrogram in Fig 3.
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4400084&req=5

pone.0123261.g004: Heat map plot for H3K4me1 track clustering, using “Similarity of relations to other sets of genomic features” and 14 Gene Ontology (GO) based reference tracks.The value of a specific cell in the heat map represents the relative aggregated coverage of the corresponding H3K4me1 track (as specified in the row heading) in gene regions associated with a corresponding GO term (as specified in the column heading). The rows of the heat map correspond directly to the feature vectors used to calculate the dendrogram in Fig 3.
Mentions: By using tracks representing genes associated to various gene ontology terms as reference tracks for the “Similarity of relations to other sets of genomic features” case, the H3K4me1 tracks are clustered based on functional enrichment. The dendrogram and heat map obtained when using the 14 selected reference tracks are shown in Figs 3 and 4, respectively. It can be seen that all samples from similar tissues cluster together. Changing the algorithm for calculating distance from euclidian to manhattan och maximum did not affect the structure of the dendrogram. The three reference tracks that contributed the most to the separation of the clustered tracks are “axonogenesis”, “caspase activation” and “cytokines”. Brain cells, and also liver cells to a lesser degree, have a higher H3K4me1 proportional occupancy in genes associated with the axonogenesis (nerve cell growth) term than the other cell types. Further, one can observe that genes associated with caspase activation is H3K4me1 enriched in muscle cells. This makes biological sense, since caspases are involved in apoptosis, which can be active in muscles as response to starvation and inflammation, and can be indicative of muscle differentiation [19]. Finally, genes associated with the term cytokines (immunomodulating agents, such as interleukins and interferons) are enriched by H3K4me1 in immune cells. The dendrogram of the batch script example of this case is shown in Supplementary S3 Fig. Supplementary S2 File lists the subclusters obtained by dividing the dendrogram of Supplementary S3 Fig based on the height of the branches. Fetal and adult brain samples form two subclusters just as they do in S2 Fig, but the relative distance between the two subclusters is smaller. Most of the subclusters found in Supplementary S1 File are present also in the dendrogram generated through this way of clustering. By comparing the Supplementary S1 and S2 Files, one can see that the fetal brain samples are found in subcluster 2 in both files, the mesenchymal stem cells, breast and fetal lung samples are found in subcluster 3, while the adipose tissue from subcluster 4 and the Mobilized CD34 Primary Cells from subcluster 6 in S1 File are found in subcluster 4 of S2 File. The adult brain samples are found in subclusters 5 and 8, respectively. The immune cells are found in subclusters 7 and 9, respectively. In S2 File, liver samples form their own subcluster 7. The “Similarity of relations to other sets of genomic features” generates 12 subclusters with many samples, as compared to 9 subclusters for “Similarity of positional distribution along the genome”.

Bottom Line: We here introduce a general methodology for clustering data sets of coordinates relative to a genome assembly, i.e. genomic tracks.An implementation of the methodology is provided through a tool, ClusTrack, which allows fine-tuned clustering analyses to be specified through a web-based interface.The majority of samples form meaningful subclusters, confirming that the definitions of features and similarity capture biological, rather than technical, variation between the genomic tracks.

View Article: PubMed Central - PubMed

Affiliation: Department of Informatics, University of Oslo, Oslo, Norway; Department of Tumour Biology, The Norwegian Radium Hospital, Oslo University Hospital, Oslo, Norway.

ABSTRACT
Clustering is a popular technique for explorative analysis of data, as it can reveal subgroupings and similarities between data in an unsupervised manner. While clustering is routinely applied to gene expression data, there is a lack of appropriate general methodology for clustering of sequence-level genomic and epigenomic data, e.g. ChIP-based data. We here introduce a general methodology for clustering data sets of coordinates relative to a genome assembly, i.e. genomic tracks. By defining appropriate feature extraction approaches and similarity measures, we allow biologically meaningful clustering to be performed for genomic tracks using standard clustering algorithms. An implementation of the methodology is provided through a tool, ClusTrack, which allows fine-tuned clustering analyses to be specified through a web-based interface. We apply our methods to the clustering of occupancy of the H3K4me1 histone modification in samples from a range of different cell types. The majority of samples form meaningful subclusters, confirming that the definitions of features and similarity capture biological, rather than technical, variation between the genomic tracks. Input data and results are available, and can be reproduced, through a Galaxy Pages document at http://hyperbrowser.uio.no/hb/u/hb-superuser/p/clustrack. The clustering functionality is available as a Galaxy tool, under the menu option "Specialized analyzis of tracks", and the submenu option "Cluster tracks based on genome level similarity", at the Genomic HyperBrowser server: http://hyperbrowser.uio.no/hb/.

Show MeSH
Related in: MedlinePlus