Limits...
ClusTrack: feature extraction and similarity measures for clustering of genome-wide data sets.

Rydbeck H, Sandve GK, Ferkingstad E, Simovski B, Rye M, Hovig E - PLoS ONE (2015)

Bottom Line: We here introduce a general methodology for clustering data sets of coordinates relative to a genome assembly, i.e. genomic tracks.An implementation of the methodology is provided through a tool, ClusTrack, which allows fine-tuned clustering analyses to be specified through a web-based interface.The majority of samples form meaningful subclusters, confirming that the definitions of features and similarity capture biological, rather than technical, variation between the genomic tracks.

View Article: PubMed Central - PubMed

Affiliation: Department of Informatics, University of Oslo, Oslo, Norway; Department of Tumour Biology, The Norwegian Radium Hospital, Oslo University Hospital, Oslo, Norway.

ABSTRACT
Clustering is a popular technique for explorative analysis of data, as it can reveal subgroupings and similarities between data in an unsupervised manner. While clustering is routinely applied to gene expression data, there is a lack of appropriate general methodology for clustering of sequence-level genomic and epigenomic data, e.g. ChIP-based data. We here introduce a general methodology for clustering data sets of coordinates relative to a genome assembly, i.e. genomic tracks. By defining appropriate feature extraction approaches and similarity measures, we allow biologically meaningful clustering to be performed for genomic tracks using standard clustering algorithms. An implementation of the methodology is provided through a tool, ClusTrack, which allows fine-tuned clustering analyses to be specified through a web-based interface. We apply our methods to the clustering of occupancy of the H3K4me1 histone modification in samples from a range of different cell types. The majority of samples form meaningful subclusters, confirming that the definitions of features and similarity capture biological, rather than technical, variation between the genomic tracks. Input data and results are available, and can be reproduced, through a Galaxy Pages document at http://hyperbrowser.uio.no/hb/u/hb-superuser/p/clustrack. The clustering functionality is available as a Galaxy tool, under the menu option "Specialized analyzis of tracks", and the submenu option "Cluster tracks based on genome level similarity", at the Genomic HyperBrowser server: http://hyperbrowser.uio.no/hb/.

Show MeSH

Related in: MedlinePlus

Dendrogram of H3K4me1 track clustering using “Similarity of relations to other sets of genomic features” and 14 gene ontology reference tracks.All sample pairs were placed into separate subclusters per tissue. The separation of sample pairs into their own subclusters was stronger in this case than in Figs 1 and 2. The clustering was here performed according to the canonical case “Similarity of relations to other sets of genomic features”. Features were defined based on relative aggregated coverage (as in Fig 2) across non-consecutive sets of genome positions. Each feature was aggregated across a set of positions defined by a particular reference track. The reference tracks were again based on gene regions associated with 14 randomly selected Gene Ontology (GO) terms. Standard Euclidian distance was used as distance between two feature vectors. Clustering was performed using standard hierarchical clustering with the average linkage criterion.
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4400084&req=5

pone.0123261.g003: Dendrogram of H3K4me1 track clustering using “Similarity of relations to other sets of genomic features” and 14 gene ontology reference tracks.All sample pairs were placed into separate subclusters per tissue. The separation of sample pairs into their own subclusters was stronger in this case than in Figs 1 and 2. The clustering was here performed according to the canonical case “Similarity of relations to other sets of genomic features”. Features were defined based on relative aggregated coverage (as in Fig 2) across non-consecutive sets of genome positions. Each feature was aggregated across a set of positions defined by a particular reference track. The reference tracks were again based on gene regions associated with 14 randomly selected Gene Ontology (GO) terms. Standard Euclidian distance was used as distance between two feature vectors. Clustering was performed using standard hierarchical clustering with the average linkage criterion.

Mentions: By using tracks representing genes associated to various gene ontology terms as reference tracks for the “Similarity of relations to other sets of genomic features” case, the H3K4me1 tracks are clustered based on functional enrichment. The dendrogram and heat map obtained when using the 14 selected reference tracks are shown in Figs 3 and 4, respectively. It can be seen that all samples from similar tissues cluster together. Changing the algorithm for calculating distance from euclidian to manhattan och maximum did not affect the structure of the dendrogram. The three reference tracks that contributed the most to the separation of the clustered tracks are “axonogenesis”, “caspase activation” and “cytokines”. Brain cells, and also liver cells to a lesser degree, have a higher H3K4me1 proportional occupancy in genes associated with the axonogenesis (nerve cell growth) term than the other cell types. Further, one can observe that genes associated with caspase activation is H3K4me1 enriched in muscle cells. This makes biological sense, since caspases are involved in apoptosis, which can be active in muscles as response to starvation and inflammation, and can be indicative of muscle differentiation [19]. Finally, genes associated with the term cytokines (immunomodulating agents, such as interleukins and interferons) are enriched by H3K4me1 in immune cells. The dendrogram of the batch script example of this case is shown in Supplementary S3 Fig. Supplementary S2 File lists the subclusters obtained by dividing the dendrogram of Supplementary S3 Fig based on the height of the branches. Fetal and adult brain samples form two subclusters just as they do in S2 Fig, but the relative distance between the two subclusters is smaller. Most of the subclusters found in Supplementary S1 File are present also in the dendrogram generated through this way of clustering. By comparing the Supplementary S1 and S2 Files, one can see that the fetal brain samples are found in subcluster 2 in both files, the mesenchymal stem cells, breast and fetal lung samples are found in subcluster 3, while the adipose tissue from subcluster 4 and the Mobilized CD34 Primary Cells from subcluster 6 in S1 File are found in subcluster 4 of S2 File. The adult brain samples are found in subclusters 5 and 8, respectively. The immune cells are found in subclusters 7 and 9, respectively. In S2 File, liver samples form their own subcluster 7. The “Similarity of relations to other sets of genomic features” generates 12 subclusters with many samples, as compared to 9 subclusters for “Similarity of positional distribution along the genome”.


ClusTrack: feature extraction and similarity measures for clustering of genome-wide data sets.

Rydbeck H, Sandve GK, Ferkingstad E, Simovski B, Rye M, Hovig E - PLoS ONE (2015)

Dendrogram of H3K4me1 track clustering using “Similarity of relations to other sets of genomic features” and 14 gene ontology reference tracks.All sample pairs were placed into separate subclusters per tissue. The separation of sample pairs into their own subclusters was stronger in this case than in Figs 1 and 2. The clustering was here performed according to the canonical case “Similarity of relations to other sets of genomic features”. Features were defined based on relative aggregated coverage (as in Fig 2) across non-consecutive sets of genome positions. Each feature was aggregated across a set of positions defined by a particular reference track. The reference tracks were again based on gene regions associated with 14 randomly selected Gene Ontology (GO) terms. Standard Euclidian distance was used as distance between two feature vectors. Clustering was performed using standard hierarchical clustering with the average linkage criterion.
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4400084&req=5

pone.0123261.g003: Dendrogram of H3K4me1 track clustering using “Similarity of relations to other sets of genomic features” and 14 gene ontology reference tracks.All sample pairs were placed into separate subclusters per tissue. The separation of sample pairs into their own subclusters was stronger in this case than in Figs 1 and 2. The clustering was here performed according to the canonical case “Similarity of relations to other sets of genomic features”. Features were defined based on relative aggregated coverage (as in Fig 2) across non-consecutive sets of genome positions. Each feature was aggregated across a set of positions defined by a particular reference track. The reference tracks were again based on gene regions associated with 14 randomly selected Gene Ontology (GO) terms. Standard Euclidian distance was used as distance between two feature vectors. Clustering was performed using standard hierarchical clustering with the average linkage criterion.
Mentions: By using tracks representing genes associated to various gene ontology terms as reference tracks for the “Similarity of relations to other sets of genomic features” case, the H3K4me1 tracks are clustered based on functional enrichment. The dendrogram and heat map obtained when using the 14 selected reference tracks are shown in Figs 3 and 4, respectively. It can be seen that all samples from similar tissues cluster together. Changing the algorithm for calculating distance from euclidian to manhattan och maximum did not affect the structure of the dendrogram. The three reference tracks that contributed the most to the separation of the clustered tracks are “axonogenesis”, “caspase activation” and “cytokines”. Brain cells, and also liver cells to a lesser degree, have a higher H3K4me1 proportional occupancy in genes associated with the axonogenesis (nerve cell growth) term than the other cell types. Further, one can observe that genes associated with caspase activation is H3K4me1 enriched in muscle cells. This makes biological sense, since caspases are involved in apoptosis, which can be active in muscles as response to starvation and inflammation, and can be indicative of muscle differentiation [19]. Finally, genes associated with the term cytokines (immunomodulating agents, such as interleukins and interferons) are enriched by H3K4me1 in immune cells. The dendrogram of the batch script example of this case is shown in Supplementary S3 Fig. Supplementary S2 File lists the subclusters obtained by dividing the dendrogram of Supplementary S3 Fig based on the height of the branches. Fetal and adult brain samples form two subclusters just as they do in S2 Fig, but the relative distance between the two subclusters is smaller. Most of the subclusters found in Supplementary S1 File are present also in the dendrogram generated through this way of clustering. By comparing the Supplementary S1 and S2 Files, one can see that the fetal brain samples are found in subcluster 2 in both files, the mesenchymal stem cells, breast and fetal lung samples are found in subcluster 3, while the adipose tissue from subcluster 4 and the Mobilized CD34 Primary Cells from subcluster 6 in S1 File are found in subcluster 4 of S2 File. The adult brain samples are found in subclusters 5 and 8, respectively. The immune cells are found in subclusters 7 and 9, respectively. In S2 File, liver samples form their own subcluster 7. The “Similarity of relations to other sets of genomic features” generates 12 subclusters with many samples, as compared to 9 subclusters for “Similarity of positional distribution along the genome”.

Bottom Line: We here introduce a general methodology for clustering data sets of coordinates relative to a genome assembly, i.e. genomic tracks.An implementation of the methodology is provided through a tool, ClusTrack, which allows fine-tuned clustering analyses to be specified through a web-based interface.The majority of samples form meaningful subclusters, confirming that the definitions of features and similarity capture biological, rather than technical, variation between the genomic tracks.

View Article: PubMed Central - PubMed

Affiliation: Department of Informatics, University of Oslo, Oslo, Norway; Department of Tumour Biology, The Norwegian Radium Hospital, Oslo University Hospital, Oslo, Norway.

ABSTRACT
Clustering is a popular technique for explorative analysis of data, as it can reveal subgroupings and similarities between data in an unsupervised manner. While clustering is routinely applied to gene expression data, there is a lack of appropriate general methodology for clustering of sequence-level genomic and epigenomic data, e.g. ChIP-based data. We here introduce a general methodology for clustering data sets of coordinates relative to a genome assembly, i.e. genomic tracks. By defining appropriate feature extraction approaches and similarity measures, we allow biologically meaningful clustering to be performed for genomic tracks using standard clustering algorithms. An implementation of the methodology is provided through a tool, ClusTrack, which allows fine-tuned clustering analyses to be specified through a web-based interface. We apply our methods to the clustering of occupancy of the H3K4me1 histone modification in samples from a range of different cell types. The majority of samples form meaningful subclusters, confirming that the definitions of features and similarity capture biological, rather than technical, variation between the genomic tracks. Input data and results are available, and can be reproduced, through a Galaxy Pages document at http://hyperbrowser.uio.no/hb/u/hb-superuser/p/clustrack. The clustering functionality is available as a Galaxy tool, under the menu option "Specialized analyzis of tracks", and the submenu option "Cluster tracks based on genome level similarity", at the Genomic HyperBrowser server: http://hyperbrowser.uio.no/hb/.

Show MeSH
Related in: MedlinePlus