Limits...
ClusTrack: feature extraction and similarity measures for clustering of genome-wide data sets.

Rydbeck H, Sandve GK, Ferkingstad E, Simovski B, Rye M, Hovig E - PLoS ONE (2015)

Bottom Line: We here introduce a general methodology for clustering data sets of coordinates relative to a genome assembly, i.e. genomic tracks.An implementation of the methodology is provided through a tool, ClusTrack, which allows fine-tuned clustering analyses to be specified through a web-based interface.The majority of samples form meaningful subclusters, confirming that the definitions of features and similarity capture biological, rather than technical, variation between the genomic tracks.

View Article: PubMed Central - PubMed

Affiliation: Department of Informatics, University of Oslo, Oslo, Norway; Department of Tumour Biology, The Norwegian Radium Hospital, Oslo University Hospital, Oslo, Norway.

ABSTRACT
Clustering is a popular technique for explorative analysis of data, as it can reveal subgroupings and similarities between data in an unsupervised manner. While clustering is routinely applied to gene expression data, there is a lack of appropriate general methodology for clustering of sequence-level genomic and epigenomic data, e.g. ChIP-based data. We here introduce a general methodology for clustering data sets of coordinates relative to a genome assembly, i.e. genomic tracks. By defining appropriate feature extraction approaches and similarity measures, we allow biologically meaningful clustering to be performed for genomic tracks using standard clustering algorithms. An implementation of the methodology is provided through a tool, ClusTrack, which allows fine-tuned clustering analyses to be specified through a web-based interface. We apply our methods to the clustering of occupancy of the H3K4me1 histone modification in samples from a range of different cell types. The majority of samples form meaningful subclusters, confirming that the definitions of features and similarity capture biological, rather than technical, variation between the genomic tracks. Input data and results are available, and can be reproduced, through a Galaxy Pages document at http://hyperbrowser.uio.no/hb/u/hb-superuser/p/clustrack. The clustering functionality is available as a Galaxy tool, under the menu option "Specialized analyzis of tracks", and the submenu option "Cluster tracks based on genome level similarity", at the Genomic HyperBrowser server: http://hyperbrowser.uio.no/hb/.

Show MeSH
Dendrogram of H3K4me1 track clustering using “Similarity of positional distribution along the genome” for the genomic region chr1-22.As in Fig 1, all sample pairs, except the one from brain, are placed into separate subclusters per tissue. The difference to the dendrogram of Fig 1 is only the length of branches in the dendrogram. The clustering was here performed according to the canonical case “Similarity of positional distribution along the genome”. Features were defined based on relative aggregated coverage in bins (of consecutive positions) along the genome, where the raw base pair coverage in a bin was divided by the total base pair coverage of the track across the genome (to normalize for varying overall coverage densities between tracks). Bins were defined as consecutive 1 Mbps regions along all autosomes (set in the “region&scale” section of the interface). Standard Euclidian distance was used as distance between two feature vectors. Clustering was performed using standard hierarchical clustering with the average linkage criterion.
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4400084&req=5

pone.0123261.g002: Dendrogram of H3K4me1 track clustering using “Similarity of positional distribution along the genome” for the genomic region chr1-22.As in Fig 1, all sample pairs, except the one from brain, are placed into separate subclusters per tissue. The difference to the dendrogram of Fig 1 is only the length of branches in the dendrogram. The clustering was here performed according to the canonical case “Similarity of positional distribution along the genome”. Features were defined based on relative aggregated coverage in bins (of consecutive positions) along the genome, where the raw base pair coverage in a bin was divided by the total base pair coverage of the track across the genome (to normalize for varying overall coverage densities between tracks). Bins were defined as consecutive 1 Mbps regions along all autosomes (set in the “region&scale” section of the interface). Standard Euclidian distance was used as distance between two feature vectors. Clustering was performed using standard hierarchical clustering with the average linkage criterion.

Mentions: The dendrogram obtained with the “Similarity of positional distribution along the genome” case, using the relative coverage per 1 Mb bins, is shown in Fig 2. The dendrogram looks, except for having shorter branches, the same as the one resulting from “Direct sequence-level similarity”. Using “manhattan distance” did not change the structure of the dendrogram, while using “maximum distance” positioned the brain samples closer together. The dendrogram resulting from the larger number of tracks as specified in the batch script, is shown in Supplementary S2 Fig. To better appreciate the degree of co-clustering of samples of similar kind, the dendrogram of Supplementary S2 Fig was divided into subclusters based on the height of the branches. Supplementary S1 File lists these subclusters. Notably, the added fetal and adult samples do distribute to their respective sub clusters (2 and 5). Subcluster 1 contains only blood samples. Subcluster 2 contains samples from fetal or embryonic brain, including “UCSF-UBC.Fetal_Brain.HuFNSC01” from the GUI example, and also neural stem cells and two embryonic stem cell elements. Subcluster 3 is large, and is dominated by mesenchymal stem cells. Subcluster 4 contains samples from kidney, liver, muscle and adipose tissues. Subcluster 5 contains samples exclusively from adult brain. Subclusters 6 and 7 contain samples exclusively from immune cells. Subcluster 14 contains stem cells and subcluster 16 contains samples from breast.


ClusTrack: feature extraction and similarity measures for clustering of genome-wide data sets.

Rydbeck H, Sandve GK, Ferkingstad E, Simovski B, Rye M, Hovig E - PLoS ONE (2015)

Dendrogram of H3K4me1 track clustering using “Similarity of positional distribution along the genome” for the genomic region chr1-22.As in Fig 1, all sample pairs, except the one from brain, are placed into separate subclusters per tissue. The difference to the dendrogram of Fig 1 is only the length of branches in the dendrogram. The clustering was here performed according to the canonical case “Similarity of positional distribution along the genome”. Features were defined based on relative aggregated coverage in bins (of consecutive positions) along the genome, where the raw base pair coverage in a bin was divided by the total base pair coverage of the track across the genome (to normalize for varying overall coverage densities between tracks). Bins were defined as consecutive 1 Mbps regions along all autosomes (set in the “region&scale” section of the interface). Standard Euclidian distance was used as distance between two feature vectors. Clustering was performed using standard hierarchical clustering with the average linkage criterion.
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4400084&req=5

pone.0123261.g002: Dendrogram of H3K4me1 track clustering using “Similarity of positional distribution along the genome” for the genomic region chr1-22.As in Fig 1, all sample pairs, except the one from brain, are placed into separate subclusters per tissue. The difference to the dendrogram of Fig 1 is only the length of branches in the dendrogram. The clustering was here performed according to the canonical case “Similarity of positional distribution along the genome”. Features were defined based on relative aggregated coverage in bins (of consecutive positions) along the genome, where the raw base pair coverage in a bin was divided by the total base pair coverage of the track across the genome (to normalize for varying overall coverage densities between tracks). Bins were defined as consecutive 1 Mbps regions along all autosomes (set in the “region&scale” section of the interface). Standard Euclidian distance was used as distance between two feature vectors. Clustering was performed using standard hierarchical clustering with the average linkage criterion.
Mentions: The dendrogram obtained with the “Similarity of positional distribution along the genome” case, using the relative coverage per 1 Mb bins, is shown in Fig 2. The dendrogram looks, except for having shorter branches, the same as the one resulting from “Direct sequence-level similarity”. Using “manhattan distance” did not change the structure of the dendrogram, while using “maximum distance” positioned the brain samples closer together. The dendrogram resulting from the larger number of tracks as specified in the batch script, is shown in Supplementary S2 Fig. To better appreciate the degree of co-clustering of samples of similar kind, the dendrogram of Supplementary S2 Fig was divided into subclusters based on the height of the branches. Supplementary S1 File lists these subclusters. Notably, the added fetal and adult samples do distribute to their respective sub clusters (2 and 5). Subcluster 1 contains only blood samples. Subcluster 2 contains samples from fetal or embryonic brain, including “UCSF-UBC.Fetal_Brain.HuFNSC01” from the GUI example, and also neural stem cells and two embryonic stem cell elements. Subcluster 3 is large, and is dominated by mesenchymal stem cells. Subcluster 4 contains samples from kidney, liver, muscle and adipose tissues. Subcluster 5 contains samples exclusively from adult brain. Subclusters 6 and 7 contain samples exclusively from immune cells. Subcluster 14 contains stem cells and subcluster 16 contains samples from breast.

Bottom Line: We here introduce a general methodology for clustering data sets of coordinates relative to a genome assembly, i.e. genomic tracks.An implementation of the methodology is provided through a tool, ClusTrack, which allows fine-tuned clustering analyses to be specified through a web-based interface.The majority of samples form meaningful subclusters, confirming that the definitions of features and similarity capture biological, rather than technical, variation between the genomic tracks.

View Article: PubMed Central - PubMed

Affiliation: Department of Informatics, University of Oslo, Oslo, Norway; Department of Tumour Biology, The Norwegian Radium Hospital, Oslo University Hospital, Oslo, Norway.

ABSTRACT
Clustering is a popular technique for explorative analysis of data, as it can reveal subgroupings and similarities between data in an unsupervised manner. While clustering is routinely applied to gene expression data, there is a lack of appropriate general methodology for clustering of sequence-level genomic and epigenomic data, e.g. ChIP-based data. We here introduce a general methodology for clustering data sets of coordinates relative to a genome assembly, i.e. genomic tracks. By defining appropriate feature extraction approaches and similarity measures, we allow biologically meaningful clustering to be performed for genomic tracks using standard clustering algorithms. An implementation of the methodology is provided through a tool, ClusTrack, which allows fine-tuned clustering analyses to be specified through a web-based interface. We apply our methods to the clustering of occupancy of the H3K4me1 histone modification in samples from a range of different cell types. The majority of samples form meaningful subclusters, confirming that the definitions of features and similarity capture biological, rather than technical, variation between the genomic tracks. Input data and results are available, and can be reproduced, through a Galaxy Pages document at http://hyperbrowser.uio.no/hb/u/hb-superuser/p/clustrack. The clustering functionality is available as a Galaxy tool, under the menu option "Specialized analyzis of tracks", and the submenu option "Cluster tracks based on genome level similarity", at the Genomic HyperBrowser server: http://hyperbrowser.uio.no/hb/.

Show MeSH