Limits...
A model-based information sharing protocol for profile Hidden Markov Models used for HIV-1 recombination detection.

Bulla I, Schultz AK, Chesneau C, Mark T, Serea F - BMC Bioinformatics (2014)

Bottom Line: In order to implement the proposed protocol, we make use of an existing HMM architecture and its associated inference engine.Thereby, we demonstrate that the performance of pHMMs can be significantly improved by the proposed technique.Moreover, we show that our algorithm performs significantly better than Simplot and Bootscanning.

View Article: PubMed Central - HTML - PubMed

Affiliation: Institut für Mathematik und Informatik, Universität Greifswald, Walther-Rathenau-Straße 47, 17487 Greifswald, Germany. ingobulla@gmail.com.

ABSTRACT

Background: In many applications, a family of nucleotide or protein sequences classified into several subfamilies has to be modeled. Profile Hidden Markov Models (pHMMs) are widely used for this task, modeling each subfamily separately by one pHMM. However, a major drawback of this approach is the difficulty of dealing with subfamilies composed of very few sequences. One of the most crucial bioinformatical tasks affected by the problem of small-size subfamilies is the subtyping of human immunodeficiency virus type 1 (HIV-1) sequences, i.e., HIV-1 subtypes for which only a small number of sequences is known.

Results: To deal with small samples for particular subfamilies of HIV-1, we introduce a novel model-based information sharing protocol. It estimates the emission probabilities of the pHMM modeling a particular subfamily not only based on the nucleotide frequencies of the respective subfamily but also incorporating the nucleotide frequencies of all available subfamilies. To this end, the underlying probabilistic model mimics the pattern of commonality and variation between the subtypes with regards to the biological characteristics of HI viruses. In order to implement the proposed protocol, we make use of an existing HMM architecture and its associated inference engine.

Conclusions: We apply the modified algorithm to classify HIV-1 sequence data in the form of partial HIV-1 sequences and semi-artificial recombinants. Thereby, we demonstrate that the performance of pHMMs can be significantly improved by the proposed technique. Moreover, we show that our algorithm performs significantly better than Simplot and Bootscanning.

Show MeSH

Related in: MedlinePlus

Examples of the calculation of the emission probabilities. Simplified example of position- and subtype-wise nucleotide frequencies of HIV and the emission probabilities derived from them using the presented information sharing protocol. For three sites the subtype-wise nucleotide frequencies for the four subtypes A-D are given on the left side of the table. Below them, the emission probabilities estimated based only on the frequencies of the respective subtype are shown, using pseudocounts . The colors indicate which subtypes should be jointly modeled in order to get the most likely source combination. The nucleotide frequencies of the sources (i.e. the aggregated frequencies of the subtypes belonging to it) as well as the emission probabilities estimated based on these frequencies are given on the right side of the table (using the same ). For the sake of simplicity, we assume only the nucleotides G and T occur. Apart from this simplification and the restriction to four subtypes, this example is taken from actual HIV-1 sequences.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4230192&req=5

Figure 1: Examples of the calculation of the emission probabilities. Simplified example of position- and subtype-wise nucleotide frequencies of HIV and the emission probabilities derived from them using the presented information sharing protocol. For three sites the subtype-wise nucleotide frequencies for the four subtypes A-D are given on the left side of the table. Below them, the emission probabilities estimated based only on the frequencies of the respective subtype are shown, using pseudocounts . The colors indicate which subtypes should be jointly modeled in order to get the most likely source combination. The nucleotide frequencies of the sources (i.e. the aggregated frequencies of the subtypes belonging to it) as well as the emission probabilities estimated based on these frequencies are given on the right side of the table (using the same ). For the sake of simplicity, we assume only the nucleotides G and T occur. Apart from this simplification and the restriction to four subtypes, this example is taken from actual HIV-1 sequences.

Mentions: For the reasons just explained, we model the emission frequencies of the subtypes jointly (see Figure 1 for examples). In the following, we refer to subtypes being modeled jointly as sharing the same source. Moreover, an assignment of each subtype to its respective source is called a “source combination”. That is, if at a given site Subtypes A, C, G, and H are modeled by one source, B and K by another and D, F, and J by a third one, the assignment {A,C,G,H}→1, {B,K}→2, {D,F,J}→3 constitutes the source combination of the respective site. In a more general context, the source combinations are called the set of partitions (which play a role e.g. in determining the number of state-context trees of parsimonious higher-order HMMs [25]). Since the number of source combinations grows fast in the number of subtypes, we have to restrict the search space when determining the optimal source combination (see Subsection ‘Methods - Details on the information sharing protocol’).


A model-based information sharing protocol for profile Hidden Markov Models used for HIV-1 recombination detection.

Bulla I, Schultz AK, Chesneau C, Mark T, Serea F - BMC Bioinformatics (2014)

Examples of the calculation of the emission probabilities. Simplified example of position- and subtype-wise nucleotide frequencies of HIV and the emission probabilities derived from them using the presented information sharing protocol. For three sites the subtype-wise nucleotide frequencies for the four subtypes A-D are given on the left side of the table. Below them, the emission probabilities estimated based only on the frequencies of the respective subtype are shown, using pseudocounts . The colors indicate which subtypes should be jointly modeled in order to get the most likely source combination. The nucleotide frequencies of the sources (i.e. the aggregated frequencies of the subtypes belonging to it) as well as the emission probabilities estimated based on these frequencies are given on the right side of the table (using the same ). For the sake of simplicity, we assume only the nucleotides G and T occur. Apart from this simplification and the restriction to four subtypes, this example is taken from actual HIV-1 sequences.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4230192&req=5

Figure 1: Examples of the calculation of the emission probabilities. Simplified example of position- and subtype-wise nucleotide frequencies of HIV and the emission probabilities derived from them using the presented information sharing protocol. For three sites the subtype-wise nucleotide frequencies for the four subtypes A-D are given on the left side of the table. Below them, the emission probabilities estimated based only on the frequencies of the respective subtype are shown, using pseudocounts . The colors indicate which subtypes should be jointly modeled in order to get the most likely source combination. The nucleotide frequencies of the sources (i.e. the aggregated frequencies of the subtypes belonging to it) as well as the emission probabilities estimated based on these frequencies are given on the right side of the table (using the same ). For the sake of simplicity, we assume only the nucleotides G and T occur. Apart from this simplification and the restriction to four subtypes, this example is taken from actual HIV-1 sequences.
Mentions: For the reasons just explained, we model the emission frequencies of the subtypes jointly (see Figure 1 for examples). In the following, we refer to subtypes being modeled jointly as sharing the same source. Moreover, an assignment of each subtype to its respective source is called a “source combination”. That is, if at a given site Subtypes A, C, G, and H are modeled by one source, B and K by another and D, F, and J by a third one, the assignment {A,C,G,H}→1, {B,K}→2, {D,F,J}→3 constitutes the source combination of the respective site. In a more general context, the source combinations are called the set of partitions (which play a role e.g. in determining the number of state-context trees of parsimonious higher-order HMMs [25]). Since the number of source combinations grows fast in the number of subtypes, we have to restrict the search space when determining the optimal source combination (see Subsection ‘Methods - Details on the information sharing protocol’).

Bottom Line: In order to implement the proposed protocol, we make use of an existing HMM architecture and its associated inference engine.Thereby, we demonstrate that the performance of pHMMs can be significantly improved by the proposed technique.Moreover, we show that our algorithm performs significantly better than Simplot and Bootscanning.

View Article: PubMed Central - HTML - PubMed

Affiliation: Institut für Mathematik und Informatik, Universität Greifswald, Walther-Rathenau-Straße 47, 17487 Greifswald, Germany. ingobulla@gmail.com.

ABSTRACT

Background: In many applications, a family of nucleotide or protein sequences classified into several subfamilies has to be modeled. Profile Hidden Markov Models (pHMMs) are widely used for this task, modeling each subfamily separately by one pHMM. However, a major drawback of this approach is the difficulty of dealing with subfamilies composed of very few sequences. One of the most crucial bioinformatical tasks affected by the problem of small-size subfamilies is the subtyping of human immunodeficiency virus type 1 (HIV-1) sequences, i.e., HIV-1 subtypes for which only a small number of sequences is known.

Results: To deal with small samples for particular subfamilies of HIV-1, we introduce a novel model-based information sharing protocol. It estimates the emission probabilities of the pHMM modeling a particular subfamily not only based on the nucleotide frequencies of the respective subfamily but also incorporating the nucleotide frequencies of all available subfamilies. To this end, the underlying probabilistic model mimics the pattern of commonality and variation between the subtypes with regards to the biological characteristics of HI viruses. In order to implement the proposed protocol, we make use of an existing HMM architecture and its associated inference engine.

Conclusions: We apply the modified algorithm to classify HIV-1 sequence data in the form of partial HIV-1 sequences and semi-artificial recombinants. Thereby, we demonstrate that the performance of pHMMs can be significantly improved by the proposed technique. Moreover, we show that our algorithm performs significantly better than Simplot and Bootscanning.

Show MeSH
Related in: MedlinePlus