Limits...
A model-based information sharing protocol for profile Hidden Markov Models used for HIV-1 recombination detection.

Bulla I, Schultz AK, Chesneau C, Mark T, Serea F - BMC Bioinformatics (2014)

Bottom Line: In order to implement the proposed protocol, we make use of an existing HMM architecture and its associated inference engine.Thereby, we demonstrate that the performance of pHMMs can be significantly improved by the proposed technique.Moreover, we show that our algorithm performs significantly better than Simplot and Bootscanning.

View Article: PubMed Central - HTML - PubMed

Affiliation: Institut für Mathematik und Informatik, Universität Greifswald, Walther-Rathenau-Straße 47, 17487 Greifswald, Germany. ingobulla@gmail.com.

ABSTRACT

Background: In many applications, a family of nucleotide or protein sequences classified into several subfamilies has to be modeled. Profile Hidden Markov Models (pHMMs) are widely used for this task, modeling each subfamily separately by one pHMM. However, a major drawback of this approach is the difficulty of dealing with subfamilies composed of very few sequences. One of the most crucial bioinformatical tasks affected by the problem of small-size subfamilies is the subtyping of human immunodeficiency virus type 1 (HIV-1) sequences, i.e., HIV-1 subtypes for which only a small number of sequences is known.

Results: To deal with small samples for particular subfamilies of HIV-1, we introduce a novel model-based information sharing protocol. It estimates the emission probabilities of the pHMM modeling a particular subfamily not only based on the nucleotide frequencies of the respective subfamily but also incorporating the nucleotide frequencies of all available subfamilies. To this end, the underlying probabilistic model mimics the pattern of commonality and variation between the subtypes with regards to the biological characteristics of HI viruses. In order to implement the proposed protocol, we make use of an existing HMM architecture and its associated inference engine.

Conclusions: We apply the modified algorithm to classify HIV-1 sequence data in the form of partial HIV-1 sequences and semi-artificial recombinants. Thereby, we demonstrate that the performance of pHMMs can be significantly improved by the proposed technique. Moreover, we show that our algorithm performs significantly better than Simplot and Bootscanning.

Show MeSH

Related in: MedlinePlus

The classification of CRF04. The classifications as given on the LANL website (first) and provided by jpHMM semi (second), jpHMM ml (third), and jpHMM prob (last), respectively. Here, Subtype U stands for an unknown subtype.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4230192&req=5

Figure 12: The classification of CRF04. The classifications as given on the LANL website (first) and provided by jpHMM semi (second), jpHMM ml (third), and jpHMM prob (last), respectively. Here, Subtype U stands for an unknown subtype.

Mentions: In order to verify that jpHMM prob is capable of handling full-length sequences which exhibit a recombination pattern observed in a real recombinant, we apply jpHMM semi, jpHMM prob, and jpHMM ml to the sequence AF119819 classified as recombinant of type CRF04 [30,31] in the LANL database. CRF04 is one of the CRFs whose genome allegedly stems in part from a small-size subtype (subtypes H and K for CRF04). The segmentation of CRF04 provided in the LANL database was obtained by Bootscanning. Since for this kind of application, one normally does not want the subtype-wise segmentation to get too fragmented, we use pjump=10−8 for jpHMM semi, pjump=10−6 for jpHMM prob and pjump=10−7 for jpHMM ml. We here have to use a considerably smaller jump probability for jpHMM semi than for jpHMM prob since we have to smooth out the misclassifications jpHMM semi does due to its less performant estimator of the emission probabilities. The results and the classification given on the LANL website are given in Figure 12.


A model-based information sharing protocol for profile Hidden Markov Models used for HIV-1 recombination detection.

Bulla I, Schultz AK, Chesneau C, Mark T, Serea F - BMC Bioinformatics (2014)

The classification of CRF04. The classifications as given on the LANL website (first) and provided by jpHMM semi (second), jpHMM ml (third), and jpHMM prob (last), respectively. Here, Subtype U stands for an unknown subtype.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4230192&req=5

Figure 12: The classification of CRF04. The classifications as given on the LANL website (first) and provided by jpHMM semi (second), jpHMM ml (third), and jpHMM prob (last), respectively. Here, Subtype U stands for an unknown subtype.
Mentions: In order to verify that jpHMM prob is capable of handling full-length sequences which exhibit a recombination pattern observed in a real recombinant, we apply jpHMM semi, jpHMM prob, and jpHMM ml to the sequence AF119819 classified as recombinant of type CRF04 [30,31] in the LANL database. CRF04 is one of the CRFs whose genome allegedly stems in part from a small-size subtype (subtypes H and K for CRF04). The segmentation of CRF04 provided in the LANL database was obtained by Bootscanning. Since for this kind of application, one normally does not want the subtype-wise segmentation to get too fragmented, we use pjump=10−8 for jpHMM semi, pjump=10−6 for jpHMM prob and pjump=10−7 for jpHMM ml. We here have to use a considerably smaller jump probability for jpHMM semi than for jpHMM prob since we have to smooth out the misclassifications jpHMM semi does due to its less performant estimator of the emission probabilities. The results and the classification given on the LANL website are given in Figure 12.

Bottom Line: In order to implement the proposed protocol, we make use of an existing HMM architecture and its associated inference engine.Thereby, we demonstrate that the performance of pHMMs can be significantly improved by the proposed technique.Moreover, we show that our algorithm performs significantly better than Simplot and Bootscanning.

View Article: PubMed Central - HTML - PubMed

Affiliation: Institut für Mathematik und Informatik, Universität Greifswald, Walther-Rathenau-Straße 47, 17487 Greifswald, Germany. ingobulla@gmail.com.

ABSTRACT

Background: In many applications, a family of nucleotide or protein sequences classified into several subfamilies has to be modeled. Profile Hidden Markov Models (pHMMs) are widely used for this task, modeling each subfamily separately by one pHMM. However, a major drawback of this approach is the difficulty of dealing with subfamilies composed of very few sequences. One of the most crucial bioinformatical tasks affected by the problem of small-size subfamilies is the subtyping of human immunodeficiency virus type 1 (HIV-1) sequences, i.e., HIV-1 subtypes for which only a small number of sequences is known.

Results: To deal with small samples for particular subfamilies of HIV-1, we introduce a novel model-based information sharing protocol. It estimates the emission probabilities of the pHMM modeling a particular subfamily not only based on the nucleotide frequencies of the respective subfamily but also incorporating the nucleotide frequencies of all available subfamilies. To this end, the underlying probabilistic model mimics the pattern of commonality and variation between the subtypes with regards to the biological characteristics of HI viruses. In order to implement the proposed protocol, we make use of an existing HMM architecture and its associated inference engine.

Conclusions: We apply the modified algorithm to classify HIV-1 sequence data in the form of partial HIV-1 sequences and semi-artificial recombinants. Thereby, we demonstrate that the performance of pHMMs can be significantly improved by the proposed technique. Moreover, we show that our algorithm performs significantly better than Simplot and Bootscanning.

Show MeSH
Related in: MedlinePlus