Limits...
Linear normalised hash function for clustering gene sequences and identifying reference sequences from multiple sequence alignments.

Helal M, Kong F, Chen SC, Zhou F, Dwyer DE, Potter J, Sintchenko V - Microb Inform Exp (2012)

Bottom Line: However, defining the optimal number of clusters, cluster density and boundaries for sets of potentially related sequences of genes with variable degrees of polymorphism remains a significant challenge.The method was evaluated using sets of closely related (16S rRNA gene sequences of Nocardia species) and highly variable (VP1 genomic region of Enterovirus 71) sequences and outperformed existing unsupervised machine learning clustering methods and dimensionality reduction methods.This method does not require prior knowledge of the number of clusters or the distance between clusters, handles clusters of different sizes and shapes, and scales linearly with the dataset.

View Article: PubMed Central - HTML - PubMed

Affiliation: Sydney Emerging Infections and Biosecurity Institute, Sydney Medical School - Westmead, University of Sydney, Sydney, New South Wales, Australia. vitali.sintchenko@swahs.health.nsw.gov.au.

ABSTRACT

Background: Comparative genomics has put additional demands on the assessment of similarity between sequences and their clustering as means for classification. However, defining the optimal number of clusters, cluster density and boundaries for sets of potentially related sequences of genes with variable degrees of polymorphism remains a significant challenge. The aim of this study was to develop a method that would identify the cluster centroids and the optimal number of clusters for a given sensitivity level and could work equally well for the different sequence datasets.

Results: A novel method that combines the linear mapping hash function and multiple sequence alignment (MSA) was developed. This method takes advantage of the already sorted by similarity sequences from the MSA output, and identifies the optimal number of clusters, clusters cut-offs, and clusters centroids that can represent reference gene vouchers for the different species. The linear mapping hash function can map an already ordered by similarity distance matrix to indices to reveal gaps in the values around which the optimal cut-offs of the different clusters can be identified. The method was evaluated using sets of closely related (16S rRNA gene sequences of Nocardia species) and highly variable (VP1 genomic region of Enterovirus 71) sequences and outperformed existing unsupervised machine learning clustering methods and dimensionality reduction methods. This method does not require prior knowledge of the number of clusters or the distance between clusters, handles clusters of different sizes and shapes, and scales linearly with the dataset.

Conclusions: The combination of MSA with the linear mapping hash function is a computationally efficient way of gene sequence clustering and can be a valuable tool for the assessment of similarity, clustering of different microbial genomes, identifying reference sequences, and for the study of evolution of bacteria and viruses.

No MeSH data available.


Related in: MedlinePlus

The linear mapping clustering for the 109 EV71 VP1 sequences shown on the first and second PCA coordinates. Eleven clusters corresponding to the genogroups/subgenogroups are presented. The legend indicates EV71 VP1 genogroups/subgenogroups, years of isolation for sequences viruses or their deposition to GenBank, and country of origin. Relevant cluster centroids are highlighted in red. Isolate AF 119795 (&), belonging to C/B genogroups, is a result of intergenotypic recombination [41]. Abbreviations: AUS, Australia; CHN, the Chinese mainland; JAP, Japan; MAL, Malaysia; NOR, Norway; SK, South Korea; SA, South Africa; SIN, Singapore; TW, Taiwan; UK, United Kingdom; USA, United States of America.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3351711&req=5

Figure 4: The linear mapping clustering for the 109 EV71 VP1 sequences shown on the first and second PCA coordinates. Eleven clusters corresponding to the genogroups/subgenogroups are presented. The legend indicates EV71 VP1 genogroups/subgenogroups, years of isolation for sequences viruses or their deposition to GenBank, and country of origin. Relevant cluster centroids are highlighted in red. Isolate AF 119795 (&), belonging to C/B genogroups, is a result of intergenotypic recombination [41]. Abbreviations: AUS, Australia; CHN, the Chinese mainland; JAP, Japan; MAL, Malaysia; NOR, Norway; SK, South Korea; SA, South Africa; SIN, Singapore; TW, Taiwan; UK, United Kingdom; USA, United States of America.

Mentions: The implementation of our method produces mean, median, and standard variation distance measures for each cluster. If the training dataset is already composed of a known number of classes (bacterial species or viral genotypes [genogroups/subgenogroups] in our case), the closest number of clusters should represent the optimal one. Otherwise, the optimal sensitivity parameters can be identified by multiple runs and the selection of parameters that minimize the standard variation of intra-cluster similarity. The PCA plot for the first dataset of 110 EV71 sequences is presented in Figure 4 and illustrates clusters of different sizes and shapes, demonstrating that the method is not likely to be sensitive to outliers. Centroids do not necessarily lie in a geometric middle position in the cluster because their calculation method is based on the distance sub-matrix of the cluster and not on the principal component transformation which is used for the two-dimensional plot.


Linear normalised hash function for clustering gene sequences and identifying reference sequences from multiple sequence alignments.

Helal M, Kong F, Chen SC, Zhou F, Dwyer DE, Potter J, Sintchenko V - Microb Inform Exp (2012)

The linear mapping clustering for the 109 EV71 VP1 sequences shown on the first and second PCA coordinates. Eleven clusters corresponding to the genogroups/subgenogroups are presented. The legend indicates EV71 VP1 genogroups/subgenogroups, years of isolation for sequences viruses or their deposition to GenBank, and country of origin. Relevant cluster centroids are highlighted in red. Isolate AF 119795 (&), belonging to C/B genogroups, is a result of intergenotypic recombination [41]. Abbreviations: AUS, Australia; CHN, the Chinese mainland; JAP, Japan; MAL, Malaysia; NOR, Norway; SK, South Korea; SA, South Africa; SIN, Singapore; TW, Taiwan; UK, United Kingdom; USA, United States of America.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3351711&req=5

Figure 4: The linear mapping clustering for the 109 EV71 VP1 sequences shown on the first and second PCA coordinates. Eleven clusters corresponding to the genogroups/subgenogroups are presented. The legend indicates EV71 VP1 genogroups/subgenogroups, years of isolation for sequences viruses or their deposition to GenBank, and country of origin. Relevant cluster centroids are highlighted in red. Isolate AF 119795 (&), belonging to C/B genogroups, is a result of intergenotypic recombination [41]. Abbreviations: AUS, Australia; CHN, the Chinese mainland; JAP, Japan; MAL, Malaysia; NOR, Norway; SK, South Korea; SA, South Africa; SIN, Singapore; TW, Taiwan; UK, United Kingdom; USA, United States of America.
Mentions: The implementation of our method produces mean, median, and standard variation distance measures for each cluster. If the training dataset is already composed of a known number of classes (bacterial species or viral genotypes [genogroups/subgenogroups] in our case), the closest number of clusters should represent the optimal one. Otherwise, the optimal sensitivity parameters can be identified by multiple runs and the selection of parameters that minimize the standard variation of intra-cluster similarity. The PCA plot for the first dataset of 110 EV71 sequences is presented in Figure 4 and illustrates clusters of different sizes and shapes, demonstrating that the method is not likely to be sensitive to outliers. Centroids do not necessarily lie in a geometric middle position in the cluster because their calculation method is based on the distance sub-matrix of the cluster and not on the principal component transformation which is used for the two-dimensional plot.

Bottom Line: However, defining the optimal number of clusters, cluster density and boundaries for sets of potentially related sequences of genes with variable degrees of polymorphism remains a significant challenge.The method was evaluated using sets of closely related (16S rRNA gene sequences of Nocardia species) and highly variable (VP1 genomic region of Enterovirus 71) sequences and outperformed existing unsupervised machine learning clustering methods and dimensionality reduction methods.This method does not require prior knowledge of the number of clusters or the distance between clusters, handles clusters of different sizes and shapes, and scales linearly with the dataset.

View Article: PubMed Central - HTML - PubMed

Affiliation: Sydney Emerging Infections and Biosecurity Institute, Sydney Medical School - Westmead, University of Sydney, Sydney, New South Wales, Australia. vitali.sintchenko@swahs.health.nsw.gov.au.

ABSTRACT

Background: Comparative genomics has put additional demands on the assessment of similarity between sequences and their clustering as means for classification. However, defining the optimal number of clusters, cluster density and boundaries for sets of potentially related sequences of genes with variable degrees of polymorphism remains a significant challenge. The aim of this study was to develop a method that would identify the cluster centroids and the optimal number of clusters for a given sensitivity level and could work equally well for the different sequence datasets.

Results: A novel method that combines the linear mapping hash function and multiple sequence alignment (MSA) was developed. This method takes advantage of the already sorted by similarity sequences from the MSA output, and identifies the optimal number of clusters, clusters cut-offs, and clusters centroids that can represent reference gene vouchers for the different species. The linear mapping hash function can map an already ordered by similarity distance matrix to indices to reveal gaps in the values around which the optimal cut-offs of the different clusters can be identified. The method was evaluated using sets of closely related (16S rRNA gene sequences of Nocardia species) and highly variable (VP1 genomic region of Enterovirus 71) sequences and outperformed existing unsupervised machine learning clustering methods and dimensionality reduction methods. This method does not require prior knowledge of the number of clusters or the distance between clusters, handles clusters of different sizes and shapes, and scales linearly with the dataset.

Conclusions: The combination of MSA with the linear mapping hash function is a computationally efficient way of gene sequence clustering and can be a valuable tool for the assessment of similarity, clustering of different microbial genomes, identifying reference sequences, and for the study of evolution of bacteria and viruses.

No MeSH data available.


Related in: MedlinePlus