Limits...
A model-based information sharing protocol for profile Hidden Markov Models used for HIV-1 recombination detection.

Bulla I, Schultz AK, Chesneau C, Mark T, Serea F - BMC Bioinformatics (2014)

Bottom Line: In order to implement the proposed protocol, we make use of an existing HMM architecture and its associated inference engine.Thereby, we demonstrate that the performance of pHMMs can be significantly improved by the proposed technique.Moreover, we show that our algorithm performs significantly better than Simplot and Bootscanning.

View Article: PubMed Central - HTML - PubMed

Affiliation: Institut für Mathematik und Informatik, Universität Greifswald, Walther-Rathenau-Straße 47, 17487 Greifswald, Germany. ingobulla@gmail.com.

ABSTRACT

Background: In many applications, a family of nucleotide or protein sequences classified into several subfamilies has to be modeled. Profile Hidden Markov Models (pHMMs) are widely used for this task, modeling each subfamily separately by one pHMM. However, a major drawback of this approach is the difficulty of dealing with subfamilies composed of very few sequences. One of the most crucial bioinformatical tasks affected by the problem of small-size subfamilies is the subtyping of human immunodeficiency virus type 1 (HIV-1) sequences, i.e., HIV-1 subtypes for which only a small number of sequences is known.

Results: To deal with small samples for particular subfamilies of HIV-1, we introduce a novel model-based information sharing protocol. It estimates the emission probabilities of the pHMM modeling a particular subfamily not only based on the nucleotide frequencies of the respective subfamily but also incorporating the nucleotide frequencies of all available subfamilies. To this end, the underlying probabilistic model mimics the pattern of commonality and variation between the subtypes with regards to the biological characteristics of HI viruses. In order to implement the proposed protocol, we make use of an existing HMM architecture and its associated inference engine.

Conclusions: We apply the modified algorithm to classify HIV-1 sequence data in the form of partial HIV-1 sequences and semi-artificial recombinants. Thereby, we demonstrate that the performance of pHMMs can be significantly improved by the proposed technique. Moreover, we show that our algorithm performs significantly better than Simplot and Bootscanning.

Show MeSH

Related in: MedlinePlus

The underlying model of jpHMM. This model is illustrated using a toy example. It is built from a DNA MSA composed of two subtypes, with the first subtype having three sequences and the second one two sequences. For each match and insert state, a vector of emission probability values for the nucleotides is given. For the sake of clarity, the majority of transitions between the two subtypes is omitted. Moreover, the delete state directly right to the begin state B (from which one can go to each match state) as well as the delete state directly left to the end state E (to which one can go from each match state) were left out. High transition probabilities are represented by fat lines, low probabilities by thin lines, and the jumps between the subtypes by dashed lines. The Viterbi path with regards to the query sequence is colored in blue, i.e., the first two positions of the query are assigned to Subtype 1 and the last two to Subtype 2.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4230192&req=5

Figure 3: The underlying model of jpHMM. This model is illustrated using a toy example. It is built from a DNA MSA composed of two subtypes, with the first subtype having three sequences and the second one two sequences. For each match and insert state, a vector of emission probability values for the nucleotides is given. For the sake of clarity, the majority of transitions between the two subtypes is omitted. Moreover, the delete state directly right to the begin state B (from which one can go to each match state) as well as the delete state directly left to the end state E (to which one can go from each match state) were left out. High transition probabilities are represented by fat lines, low probabilities by thin lines, and the jumps between the subtypes by dashed lines. The Viterbi path with regards to the query sequence is colored in blue, i.e., the first two positions of the query are assigned to Subtype 1 and the last two to Subtype 2.

Mentions: The recombination and breakpoint detection tool jpHMM requires a pre-calculated MSA of the HIV-1 subtypes as input (see Figure 2). Each subtype in the MSA is modeled by a separate pHMM (see Figure 3). In addition to the usual transitions within these pHMMs, the model allows for so-called jumps between different pHMMs at nearly any position in the MSA. That is, the model allows to jump between states associated with different subtypes, depending on the local similarity of the query sequence to the different subtypes. The complete model including the setting of the hyper-parameters is detailed in [4].


A model-based information sharing protocol for profile Hidden Markov Models used for HIV-1 recombination detection.

Bulla I, Schultz AK, Chesneau C, Mark T, Serea F - BMC Bioinformatics (2014)

The underlying model of jpHMM. This model is illustrated using a toy example. It is built from a DNA MSA composed of two subtypes, with the first subtype having three sequences and the second one two sequences. For each match and insert state, a vector of emission probability values for the nucleotides is given. For the sake of clarity, the majority of transitions between the two subtypes is omitted. Moreover, the delete state directly right to the begin state B (from which one can go to each match state) as well as the delete state directly left to the end state E (to which one can go from each match state) were left out. High transition probabilities are represented by fat lines, low probabilities by thin lines, and the jumps between the subtypes by dashed lines. The Viterbi path with regards to the query sequence is colored in blue, i.e., the first two positions of the query are assigned to Subtype 1 and the last two to Subtype 2.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4230192&req=5

Figure 3: The underlying model of jpHMM. This model is illustrated using a toy example. It is built from a DNA MSA composed of two subtypes, with the first subtype having three sequences and the second one two sequences. For each match and insert state, a vector of emission probability values for the nucleotides is given. For the sake of clarity, the majority of transitions between the two subtypes is omitted. Moreover, the delete state directly right to the begin state B (from which one can go to each match state) as well as the delete state directly left to the end state E (to which one can go from each match state) were left out. High transition probabilities are represented by fat lines, low probabilities by thin lines, and the jumps between the subtypes by dashed lines. The Viterbi path with regards to the query sequence is colored in blue, i.e., the first two positions of the query are assigned to Subtype 1 and the last two to Subtype 2.
Mentions: The recombination and breakpoint detection tool jpHMM requires a pre-calculated MSA of the HIV-1 subtypes as input (see Figure 2). Each subtype in the MSA is modeled by a separate pHMM (see Figure 3). In addition to the usual transitions within these pHMMs, the model allows for so-called jumps between different pHMMs at nearly any position in the MSA. That is, the model allows to jump between states associated with different subtypes, depending on the local similarity of the query sequence to the different subtypes. The complete model including the setting of the hyper-parameters is detailed in [4].

Bottom Line: In order to implement the proposed protocol, we make use of an existing HMM architecture and its associated inference engine.Thereby, we demonstrate that the performance of pHMMs can be significantly improved by the proposed technique.Moreover, we show that our algorithm performs significantly better than Simplot and Bootscanning.

View Article: PubMed Central - HTML - PubMed

Affiliation: Institut für Mathematik und Informatik, Universität Greifswald, Walther-Rathenau-Straße 47, 17487 Greifswald, Germany. ingobulla@gmail.com.

ABSTRACT

Background: In many applications, a family of nucleotide or protein sequences classified into several subfamilies has to be modeled. Profile Hidden Markov Models (pHMMs) are widely used for this task, modeling each subfamily separately by one pHMM. However, a major drawback of this approach is the difficulty of dealing with subfamilies composed of very few sequences. One of the most crucial bioinformatical tasks affected by the problem of small-size subfamilies is the subtyping of human immunodeficiency virus type 1 (HIV-1) sequences, i.e., HIV-1 subtypes for which only a small number of sequences is known.

Results: To deal with small samples for particular subfamilies of HIV-1, we introduce a novel model-based information sharing protocol. It estimates the emission probabilities of the pHMM modeling a particular subfamily not only based on the nucleotide frequencies of the respective subfamily but also incorporating the nucleotide frequencies of all available subfamilies. To this end, the underlying probabilistic model mimics the pattern of commonality and variation between the subtypes with regards to the biological characteristics of HI viruses. In order to implement the proposed protocol, we make use of an existing HMM architecture and its associated inference engine.

Conclusions: We apply the modified algorithm to classify HIV-1 sequence data in the form of partial HIV-1 sequences and semi-artificial recombinants. Thereby, we demonstrate that the performance of pHMMs can be significantly improved by the proposed technique. Moreover, we show that our algorithm performs significantly better than Simplot and Bootscanning.

Show MeSH
Related in: MedlinePlus