Limits...
A model-based information sharing protocol for profile Hidden Markov Models used for HIV-1 recombination detection.

Bulla I, Schultz AK, Chesneau C, Mark T, Serea F - BMC Bioinformatics (2014)

Bottom Line: In order to implement the proposed protocol, we make use of an existing HMM architecture and its associated inference engine.Thereby, we demonstrate that the performance of pHMMs can be significantly improved by the proposed technique.Moreover, we show that our algorithm performs significantly better than Simplot and Bootscanning.

View Article: PubMed Central - HTML - PubMed

Affiliation: Institut für Mathematik und Informatik, Universität Greifswald, Walther-Rathenau-Straße 47, 17487 Greifswald, Germany. ingobulla@gmail.com.

ABSTRACT

Background: In many applications, a family of nucleotide or protein sequences classified into several subfamilies has to be modeled. Profile Hidden Markov Models (pHMMs) are widely used for this task, modeling each subfamily separately by one pHMM. However, a major drawback of this approach is the difficulty of dealing with subfamilies composed of very few sequences. One of the most crucial bioinformatical tasks affected by the problem of small-size subfamilies is the subtyping of human immunodeficiency virus type 1 (HIV-1) sequences, i.e., HIV-1 subtypes for which only a small number of sequences is known.

Results: To deal with small samples for particular subfamilies of HIV-1, we introduce a novel model-based information sharing protocol. It estimates the emission probabilities of the pHMM modeling a particular subfamily not only based on the nucleotide frequencies of the respective subfamily but also incorporating the nucleotide frequencies of all available subfamilies. To this end, the underlying probabilistic model mimics the pattern of commonality and variation between the subtypes with regards to the biological characteristics of HI viruses. In order to implement the proposed protocol, we make use of an existing HMM architecture and its associated inference engine.

Conclusions: We apply the modified algorithm to classify HIV-1 sequence data in the form of partial HIV-1 sequences and semi-artificial recombinants. Thereby, we demonstrate that the performance of pHMMs can be significantly improved by the proposed technique. Moreover, we show that our algorithm performs significantly better than Simplot and Bootscanning.

Show MeSH

Related in: MedlinePlus

Restricted search of the space of source combinations. The heuristics described in Section ‘Details on the information sharing protocol - Restricting the search space’ is illustrated. In this example, we assume 4 subtypes A, B, C, and D. In each line a source combination is given, with the source combination in each  being grouped. The most likely source combinations in each  is colored in red. The arrows indicate how the spaces  are traversed during the search process.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4230192&req=5

Figure 4: Restricted search of the space of source combinations. The heuristics described in Section ‘Details on the information sharing protocol - Restricting the search space’ is illustrated. In this example, we assume 4 subtypes A, B, C, and D. In each line a source combination is given, with the source combination in each being grouped. The most likely source combinations in each is colored in red. The arrows indicate how the spaces are traversed during the search process.

Mentions: In [27], we restricted our site-wise search in S to (Sr)r≤3, where Sr is the space of source combinations composed of r sources. Since /Sr/=S(F,r) (with S() the Stirling numbers of the second kind) and /S/=B(F) (with B() the Bell numbersb) a brute force search in the entire space S would imply a considerable computational burden: Increasing Rmax from 3 to 6 would result in an increase of the computational effort by about a factor of 6. Hence, we restrict the search in S by the following procedure, illustrated in Figure 4. We search (Sr)r≤6 successively, starting with S1, which only contains one source combination. Before searching Sr, r≥2, we determine the most likely element of Sr−1 (calling it ). Then we restrict our search in Sr to those source combinations which can be obtained from by dividing one of its sources into two, thus obtaining . We denote the subset of Sr in which we conduct the search by .


A model-based information sharing protocol for profile Hidden Markov Models used for HIV-1 recombination detection.

Bulla I, Schultz AK, Chesneau C, Mark T, Serea F - BMC Bioinformatics (2014)

Restricted search of the space of source combinations. The heuristics described in Section ‘Details on the information sharing protocol - Restricting the search space’ is illustrated. In this example, we assume 4 subtypes A, B, C, and D. In each line a source combination is given, with the source combination in each  being grouped. The most likely source combinations in each  is colored in red. The arrows indicate how the spaces  are traversed during the search process.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4230192&req=5

Figure 4: Restricted search of the space of source combinations. The heuristics described in Section ‘Details on the information sharing protocol - Restricting the search space’ is illustrated. In this example, we assume 4 subtypes A, B, C, and D. In each line a source combination is given, with the source combination in each being grouped. The most likely source combinations in each is colored in red. The arrows indicate how the spaces are traversed during the search process.
Mentions: In [27], we restricted our site-wise search in S to (Sr)r≤3, where Sr is the space of source combinations composed of r sources. Since /Sr/=S(F,r) (with S() the Stirling numbers of the second kind) and /S/=B(F) (with B() the Bell numbersb) a brute force search in the entire space S would imply a considerable computational burden: Increasing Rmax from 3 to 6 would result in an increase of the computational effort by about a factor of 6. Hence, we restrict the search in S by the following procedure, illustrated in Figure 4. We search (Sr)r≤6 successively, starting with S1, which only contains one source combination. Before searching Sr, r≥2, we determine the most likely element of Sr−1 (calling it ). Then we restrict our search in Sr to those source combinations which can be obtained from by dividing one of its sources into two, thus obtaining . We denote the subset of Sr in which we conduct the search by .

Bottom Line: In order to implement the proposed protocol, we make use of an existing HMM architecture and its associated inference engine.Thereby, we demonstrate that the performance of pHMMs can be significantly improved by the proposed technique.Moreover, we show that our algorithm performs significantly better than Simplot and Bootscanning.

View Article: PubMed Central - HTML - PubMed

Affiliation: Institut für Mathematik und Informatik, Universität Greifswald, Walther-Rathenau-Straße 47, 17487 Greifswald, Germany. ingobulla@gmail.com.

ABSTRACT

Background: In many applications, a family of nucleotide or protein sequences classified into several subfamilies has to be modeled. Profile Hidden Markov Models (pHMMs) are widely used for this task, modeling each subfamily separately by one pHMM. However, a major drawback of this approach is the difficulty of dealing with subfamilies composed of very few sequences. One of the most crucial bioinformatical tasks affected by the problem of small-size subfamilies is the subtyping of human immunodeficiency virus type 1 (HIV-1) sequences, i.e., HIV-1 subtypes for which only a small number of sequences is known.

Results: To deal with small samples for particular subfamilies of HIV-1, we introduce a novel model-based information sharing protocol. It estimates the emission probabilities of the pHMM modeling a particular subfamily not only based on the nucleotide frequencies of the respective subfamily but also incorporating the nucleotide frequencies of all available subfamilies. To this end, the underlying probabilistic model mimics the pattern of commonality and variation between the subtypes with regards to the biological characteristics of HI viruses. In order to implement the proposed protocol, we make use of an existing HMM architecture and its associated inference engine.

Conclusions: We apply the modified algorithm to classify HIV-1 sequence data in the form of partial HIV-1 sequences and semi-artificial recombinants. Thereby, we demonstrate that the performance of pHMMs can be significantly improved by the proposed technique. Moreover, we show that our algorithm performs significantly better than Simplot and Bootscanning.

Show MeSH
Related in: MedlinePlus