Limits...
A model-based information sharing protocol for profile Hidden Markov Models used for HIV-1 recombination detection.

Bulla I, Schultz AK, Chesneau C, Mark T, Serea F - BMC Bioinformatics (2014)

Bottom Line: In order to implement the proposed protocol, we make use of an existing HMM architecture and its associated inference engine.Thereby, we demonstrate that the performance of pHMMs can be significantly improved by the proposed technique.Moreover, we show that our algorithm performs significantly better than Simplot and Bootscanning.

View Article: PubMed Central - HTML - PubMed

Affiliation: Institut für Mathematik und Informatik, Universität Greifswald, Walther-Rathenau-Straße 47, 17487 Greifswald, Germany. ingobulla@gmail.com.

ABSTRACT

Background: In many applications, a family of nucleotide or protein sequences classified into several subfamilies has to be modeled. Profile Hidden Markov Models (pHMMs) are widely used for this task, modeling each subfamily separately by one pHMM. However, a major drawback of this approach is the difficulty of dealing with subfamilies composed of very few sequences. One of the most crucial bioinformatical tasks affected by the problem of small-size subfamilies is the subtyping of human immunodeficiency virus type 1 (HIV-1) sequences, i.e., HIV-1 subtypes for which only a small number of sequences is known.

Results: To deal with small samples for particular subfamilies of HIV-1, we introduce a novel model-based information sharing protocol. It estimates the emission probabilities of the pHMM modeling a particular subfamily not only based on the nucleotide frequencies of the respective subfamily but also incorporating the nucleotide frequencies of all available subfamilies. To this end, the underlying probabilistic model mimics the pattern of commonality and variation between the subtypes with regards to the biological characteristics of HI viruses. In order to implement the proposed protocol, we make use of an existing HMM architecture and its associated inference engine.

Conclusions: We apply the modified algorithm to classify HIV-1 sequence data in the form of partial HIV-1 sequences and semi-artificial recombinants. Thereby, we demonstrate that the performance of pHMMs can be significantly improved by the proposed technique. Moreover, we show that our algorithm performs significantly better than Simplot and Bootscanning.

Show MeSH

Related in: MedlinePlus

Detailed calculation of the emission probabilities. Simplified example of the detailed derivation of the emission probabilities from the nucleotide frequencies. At the top, the nucleotide frequencies for each subtype are given and, at the bottom, the emission probabilities derived from these frequencies are provided. In between, for each source combination, the probability of the respective source combination given the nucleotide frequencies is shown on the left side, whereas the emission probabilities of each subtype for the particular source combination is given on the right side. The coloring indicates to which source each subtype belongs for the respective source combination. For example, the third column in the middle block represents the source combination, for which Subtype A and C belong to one source (green) and Subtype B to another (red), yielding emission probabilities of 0.844 and 0.156, respectively, for the green source and of 0.082 and 0.918, respectively, for the red one (we elaborate on some details in this table in the supplements). The simplification made consists of the restriction to 3 subtypes and 2 nucleotides.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4230192&req=5

Figure 5: Detailed calculation of the emission probabilities. Simplified example of the detailed derivation of the emission probabilities from the nucleotide frequencies. At the top, the nucleotide frequencies for each subtype are given and, at the bottom, the emission probabilities derived from these frequencies are provided. In between, for each source combination, the probability of the respective source combination given the nucleotide frequencies is shown on the left side, whereas the emission probabilities of each subtype for the particular source combination is given on the right side. The coloring indicates to which source each subtype belongs for the respective source combination. For example, the third column in the middle block represents the source combination, for which Subtype A and C belong to one source (green) and Subtype B to another (red), yielding emission probabilities of 0.844 and 0.156, respectively, for the green source and of 0.082 and 0.918, respectively, for the red one (we elaborate on some details in this table in the supplements). The simplification made consists of the restriction to 3 subtypes and 2 nucleotides.

Mentions: With these two formulas, is calculable using only known quantities. An example illustrating the details on the calculation of the emission probabilities is given in Figure 5. We provide further details on the calculations presented in this subsection in the Additional file 1: Supplements.


A model-based information sharing protocol for profile Hidden Markov Models used for HIV-1 recombination detection.

Bulla I, Schultz AK, Chesneau C, Mark T, Serea F - BMC Bioinformatics (2014)

Detailed calculation of the emission probabilities. Simplified example of the detailed derivation of the emission probabilities from the nucleotide frequencies. At the top, the nucleotide frequencies for each subtype are given and, at the bottom, the emission probabilities derived from these frequencies are provided. In between, for each source combination, the probability of the respective source combination given the nucleotide frequencies is shown on the left side, whereas the emission probabilities of each subtype for the particular source combination is given on the right side. The coloring indicates to which source each subtype belongs for the respective source combination. For example, the third column in the middle block represents the source combination, for which Subtype A and C belong to one source (green) and Subtype B to another (red), yielding emission probabilities of 0.844 and 0.156, respectively, for the green source and of 0.082 and 0.918, respectively, for the red one (we elaborate on some details in this table in the supplements). The simplification made consists of the restriction to 3 subtypes and 2 nucleotides.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4230192&req=5

Figure 5: Detailed calculation of the emission probabilities. Simplified example of the detailed derivation of the emission probabilities from the nucleotide frequencies. At the top, the nucleotide frequencies for each subtype are given and, at the bottom, the emission probabilities derived from these frequencies are provided. In between, for each source combination, the probability of the respective source combination given the nucleotide frequencies is shown on the left side, whereas the emission probabilities of each subtype for the particular source combination is given on the right side. The coloring indicates to which source each subtype belongs for the respective source combination. For example, the third column in the middle block represents the source combination, for which Subtype A and C belong to one source (green) and Subtype B to another (red), yielding emission probabilities of 0.844 and 0.156, respectively, for the green source and of 0.082 and 0.918, respectively, for the red one (we elaborate on some details in this table in the supplements). The simplification made consists of the restriction to 3 subtypes and 2 nucleotides.
Mentions: With these two formulas, is calculable using only known quantities. An example illustrating the details on the calculation of the emission probabilities is given in Figure 5. We provide further details on the calculations presented in this subsection in the Additional file 1: Supplements.

Bottom Line: In order to implement the proposed protocol, we make use of an existing HMM architecture and its associated inference engine.Thereby, we demonstrate that the performance of pHMMs can be significantly improved by the proposed technique.Moreover, we show that our algorithm performs significantly better than Simplot and Bootscanning.

View Article: PubMed Central - HTML - PubMed

Affiliation: Institut für Mathematik und Informatik, Universität Greifswald, Walther-Rathenau-Straße 47, 17487 Greifswald, Germany. ingobulla@gmail.com.

ABSTRACT

Background: In many applications, a family of nucleotide or protein sequences classified into several subfamilies has to be modeled. Profile Hidden Markov Models (pHMMs) are widely used for this task, modeling each subfamily separately by one pHMM. However, a major drawback of this approach is the difficulty of dealing with subfamilies composed of very few sequences. One of the most crucial bioinformatical tasks affected by the problem of small-size subfamilies is the subtyping of human immunodeficiency virus type 1 (HIV-1) sequences, i.e., HIV-1 subtypes for which only a small number of sequences is known.

Results: To deal with small samples for particular subfamilies of HIV-1, we introduce a novel model-based information sharing protocol. It estimates the emission probabilities of the pHMM modeling a particular subfamily not only based on the nucleotide frequencies of the respective subfamily but also incorporating the nucleotide frequencies of all available subfamilies. To this end, the underlying probabilistic model mimics the pattern of commonality and variation between the subtypes with regards to the biological characteristics of HI viruses. In order to implement the proposed protocol, we make use of an existing HMM architecture and its associated inference engine.

Conclusions: We apply the modified algorithm to classify HIV-1 sequence data in the form of partial HIV-1 sequences and semi-artificial recombinants. Thereby, we demonstrate that the performance of pHMMs can be significantly improved by the proposed technique. Moreover, we show that our algorithm performs significantly better than Simplot and Bootscanning.

Show MeSH
Related in: MedlinePlus