Limits...
The threshold bootstrap clustering: a new approach to find families or transmission clusters within molecular quasispecies.

Prosperi MC, De Luca A, Di Giambenedetto S, Bracciale L, Fabbiani M, Cauda R, Salemi M - PLoS ONE (2010)

Bottom Line: We introduce the threshold bootstrap clustering (TBC), a new methodology for partitioning molecular sequences, that does not require a phylogenetic tree estimation.TBC was successfully tested for the identification of type-1 human immunodeficiency and hepatitis C virus subtypes, and compared with previously established methodologies.TBC has been shown to be effective for the subtyping of HIV and HCV, and for identifying intra-patient quasispecies.

View Article: PubMed Central - PubMed

Affiliation: Infectious Diseases Clinic, Catholic University of the Sacred Heart, Rome, Italy.

ABSTRACT

Background: Phylogenetic methods produce hierarchies of molecular species, inferring knowledge about taxonomy and evolution. However, there is not yet a consensus methodology that provides a crisp partition of taxa, desirable when considering the problem of intra/inter-patient quasispecies classification or infection transmission event identification. We introduce the threshold bootstrap clustering (TBC), a new methodology for partitioning molecular sequences, that does not require a phylogenetic tree estimation.

Methodology/principal findings: The TBC is an incremental partition algorithm, inspired by the stochastic Chinese restaurant process, and takes advantage of resampling techniques and models of sequence evolution. TBC uses as input a multiple alignment of molecular sequences and its output is a crisp partition of the taxa into an automatically determined number of clusters. By varying initial conditions, the algorithm can produce different partitions. We describe a procedure that selects a prime partition among a set of candidate ones and calculates a measure of cluster reliability. TBC was successfully tested for the identification of type-1 human immunodeficiency and hepatitis C virus subtypes, and compared with previously established methodologies. It was also evaluated in the problem of HIV-1 intra-patient quasispecies clustering, and for transmission cluster identification, using a set of sequences from patients with known transmission event histories.

Conclusion: TBC has been shown to be effective for the subtyping of HIV and HCV, and for identifying intra-patient quasispecies. To some extent, the algorithm was able also to infer clusters corresponding to events of infection transmission. The computational complexity of TBC is quadratic in the number of taxa, lower than other established methods; in addition, TBC has been enhanced with a measure of cluster reliability. The TBC can be useful to characterise molecular quasipecies in a broad context.

Show MeSH

Related in: MedlinePlus

Phylogeny and TBC of HIV/HCV subtypes.Phylogenetic trees constructed for the HCV genotype (panel A) and group M HIV-1 subtype (panel B) reference sets of Los Alamos repositories (neighbour-joining, LogDet distance). Coloured branches represent clusters retrieved by the CTree algorithm, whilst circles represent clusters retrieved by the TBC algorithm.
© Copyright Policy
Related In: Results  -  Collection


getmorefigures.php?uid=PMC2963616&req=5

pone-0013619-g003: Phylogeny and TBC of HIV/HCV subtypes.Phylogenetic trees constructed for the HCV genotype (panel A) and group M HIV-1 subtype (panel B) reference sets of Los Alamos repositories (neighbour-joining, LogDet distance). Coloured branches represent clusters retrieved by the CTree algorithm, whilst circles represent clusters retrieved by the TBC algorithm.

Mentions: The HIV subtype reference multiple alignment (i) was composed by 38 representative sequences of 11 pure subtypes (on average 3.4 sequences for each subtype). TBC was run either on the whole genome (≈9,000 bases) or restricting on the polymerase gene (≈2700 bases), which is the routinely sequenced gene in clinical practice for drug-resistance testing. By considering the full-length genome set, a TBC tuned on t = 10 yielded a perfect concordance with the HIV subtypes, producing exactly 11 clusters, with an ARI between clusters and real subtypes of 1.0, a median (IQR) cluster support of 95% (72%–100%). Figure 2, panel A, depicts the distribution of pairwise distances, using the LogDet estimator, with a median (IQR) distance of 0.050 (0.048–0.051). The CTree also produced a perfect partition, whilst the PAM with silhouette maximisation identified exactly all subtypes except for B and D, that were pooled together. When executing the TBC using the sole pol gene, at t = 10 all sequences but one were partitioned correctly (Figure 3, panel B): ARI was 0.967, median (IQR) cluster support was 92% (63%–98%). The misplaced sequence was a subtype D, that clustered alone (support was 42%). By looking at the whole set of partitions, often subtype D divided into two distinct clusters (formed either by one or two sequences out of four). The CTree algorithm yielded an ARI of 0.955; in this case two subtype D sequences formed a cluster apart, similarly to the TBC. The PAM, instead, pooled together subtype B and D. As expected, the number of clusters decreased by increasing the percentile threshold: in detail, for the whole genome case, at t = 5 ARI was 0.511 and number of clusters was 26, whilst at t = 25 ARI was 0.375 and number of clusters was 6.


The threshold bootstrap clustering: a new approach to find families or transmission clusters within molecular quasispecies.

Prosperi MC, De Luca A, Di Giambenedetto S, Bracciale L, Fabbiani M, Cauda R, Salemi M - PLoS ONE (2010)

Phylogeny and TBC of HIV/HCV subtypes.Phylogenetic trees constructed for the HCV genotype (panel A) and group M HIV-1 subtype (panel B) reference sets of Los Alamos repositories (neighbour-joining, LogDet distance). Coloured branches represent clusters retrieved by the CTree algorithm, whilst circles represent clusters retrieved by the TBC algorithm.
© Copyright Policy
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC2963616&req=5

pone-0013619-g003: Phylogeny and TBC of HIV/HCV subtypes.Phylogenetic trees constructed for the HCV genotype (panel A) and group M HIV-1 subtype (panel B) reference sets of Los Alamos repositories (neighbour-joining, LogDet distance). Coloured branches represent clusters retrieved by the CTree algorithm, whilst circles represent clusters retrieved by the TBC algorithm.
Mentions: The HIV subtype reference multiple alignment (i) was composed by 38 representative sequences of 11 pure subtypes (on average 3.4 sequences for each subtype). TBC was run either on the whole genome (≈9,000 bases) or restricting on the polymerase gene (≈2700 bases), which is the routinely sequenced gene in clinical practice for drug-resistance testing. By considering the full-length genome set, a TBC tuned on t = 10 yielded a perfect concordance with the HIV subtypes, producing exactly 11 clusters, with an ARI between clusters and real subtypes of 1.0, a median (IQR) cluster support of 95% (72%–100%). Figure 2, panel A, depicts the distribution of pairwise distances, using the LogDet estimator, with a median (IQR) distance of 0.050 (0.048–0.051). The CTree also produced a perfect partition, whilst the PAM with silhouette maximisation identified exactly all subtypes except for B and D, that were pooled together. When executing the TBC using the sole pol gene, at t = 10 all sequences but one were partitioned correctly (Figure 3, panel B): ARI was 0.967, median (IQR) cluster support was 92% (63%–98%). The misplaced sequence was a subtype D, that clustered alone (support was 42%). By looking at the whole set of partitions, often subtype D divided into two distinct clusters (formed either by one or two sequences out of four). The CTree algorithm yielded an ARI of 0.955; in this case two subtype D sequences formed a cluster apart, similarly to the TBC. The PAM, instead, pooled together subtype B and D. As expected, the number of clusters decreased by increasing the percentile threshold: in detail, for the whole genome case, at t = 5 ARI was 0.511 and number of clusters was 26, whilst at t = 25 ARI was 0.375 and number of clusters was 6.

Bottom Line: We introduce the threshold bootstrap clustering (TBC), a new methodology for partitioning molecular sequences, that does not require a phylogenetic tree estimation.TBC was successfully tested for the identification of type-1 human immunodeficiency and hepatitis C virus subtypes, and compared with previously established methodologies.TBC has been shown to be effective for the subtyping of HIV and HCV, and for identifying intra-patient quasispecies.

View Article: PubMed Central - PubMed

Affiliation: Infectious Diseases Clinic, Catholic University of the Sacred Heart, Rome, Italy.

ABSTRACT

Background: Phylogenetic methods produce hierarchies of molecular species, inferring knowledge about taxonomy and evolution. However, there is not yet a consensus methodology that provides a crisp partition of taxa, desirable when considering the problem of intra/inter-patient quasispecies classification or infection transmission event identification. We introduce the threshold bootstrap clustering (TBC), a new methodology for partitioning molecular sequences, that does not require a phylogenetic tree estimation.

Methodology/principal findings: The TBC is an incremental partition algorithm, inspired by the stochastic Chinese restaurant process, and takes advantage of resampling techniques and models of sequence evolution. TBC uses as input a multiple alignment of molecular sequences and its output is a crisp partition of the taxa into an automatically determined number of clusters. By varying initial conditions, the algorithm can produce different partitions. We describe a procedure that selects a prime partition among a set of candidate ones and calculates a measure of cluster reliability. TBC was successfully tested for the identification of type-1 human immunodeficiency and hepatitis C virus subtypes, and compared with previously established methodologies. It was also evaluated in the problem of HIV-1 intra-patient quasispecies clustering, and for transmission cluster identification, using a set of sequences from patients with known transmission event histories.

Conclusion: TBC has been shown to be effective for the subtyping of HIV and HCV, and for identifying intra-patient quasispecies. To some extent, the algorithm was able also to infer clusters corresponding to events of infection transmission. The computational complexity of TBC is quadratic in the number of taxa, lower than other established methods; in addition, TBC has been enhanced with a measure of cluster reliability. The TBC can be useful to characterise molecular quasipecies in a broad context.

Show MeSH
Related in: MedlinePlus