Limits...
Transposon identification using profile HMMs.

Edlefsen PT, Liu JS - BMC Genomics (2010)

Bottom Line: They are known to affect transcriptional regulation in several different ways, and are implicated in many human diseases.Modeling transposon-derived genomic content is difficult because they are poorly conserved.The DNA domain is especially problematic for the Baum-Welch algorithm, since it has only four letters as opposed to the twenty residues of the amino acid alphabet.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Statistics, Harvard University, Cambridge, MA, USA. edlefsen@stat.harvard.edu

ABSTRACT

Background: Transposons are "jumping genes" that account for large quantities of repetitive content in genomes. They are known to affect transcriptional regulation in several different ways, and are implicated in many human diseases. Transposons are related to microRNAs and viruses, and many genes, pseudogenes, and gene promoters are derived from transposons or have origins in transposon-induced duplication. Modeling transposon-derived genomic content is difficult because they are poorly conserved. Profile hidden Markov models (profile HMMs), widely used for protein sequence family modeling, are rarely used for modeling DNA sequence families. The algorithm commonly used to estimate the parameters of profile HMMs, Baum-Welch, is prone to prematurely converge to local optima. The DNA domain is especially problematic for the Baum-Welch algorithm, since it has only four letters as opposed to the twenty residues of the amino acid alphabet.

Results: We demonstrate with a simulation study and with an application to modeling the MIR family of transposons that two recently introduced methods, Conditional Baum-Welch and Dynamic Model Surgery, achieve better estimates of the parameters of profile HMMs across a range of conditions.

Conclusions: We argue that these new algorithms expand the range of potential applications of profile HMMs to many important DNA sequence family modeling problems, including that of searching for and modeling the virus-like transposons that are found in all known genomes.

Show MeSH

Related in: MedlinePlus

DNA training data simulation results. Each row in (a) and set of bars in (b) corresponds to a different conservation level in the 4 "true profiles". The first column and the top bar (white) depicts the average log-probability of the training sequences using the 16 starting profiles. The second (amber) depicts the average log-probability of the training sequences using the true profiles. The third (red) depicts the average log-probability using the Baum-Welch (BW) algorithm. The fourth (blue) depicts using the Conditional Baum-Welch (CBW) algorithm. The fifth (yellow) depicts using the Dynamic Model Surgery (DMS) algorithm with the BW algorithm, and the sixth (green) depicts using the CBW and DMS algorithms together. All bars in (b) have been shifted so that the lowest bar is at 0. The new algorithms all outperform BW at conservation levels above .6. At .5 and below, CBW does not outperform BW. At the highest levels, both algorithms employing DMS perform equally well.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2822524&req=5

Figure 3: DNA training data simulation results. Each row in (a) and set of bars in (b) corresponds to a different conservation level in the 4 "true profiles". The first column and the top bar (white) depicts the average log-probability of the training sequences using the 16 starting profiles. The second (amber) depicts the average log-probability of the training sequences using the true profiles. The third (red) depicts the average log-probability using the Baum-Welch (BW) algorithm. The fourth (blue) depicts using the Conditional Baum-Welch (CBW) algorithm. The fifth (yellow) depicts using the Dynamic Model Surgery (DMS) algorithm with the BW algorithm, and the sixth (green) depicts using the CBW and DMS algorithms together. All bars in (b) have been shifted so that the lowest bar is at 0. The new algorithms all outperform BW at conservation levels above .6. At .5 and below, CBW does not outperform BW. At the highest levels, both algorithms employing DMS perform equally well.

Mentions: Figure 3 depicts DNA training-data results averaged over 16 runs (four starting parameter values for each of four training sets) at each conservation level. Figure 4 depicts DNA test-data results averaged over the 16 runs at each conservation level. In both the training results and the test results, the new algorithms all outperform BW at conservation levels above .5, with DMS providing a greater improvement than CBW. These results indicate that the combination of the CBW and DMS algorithms provides the greatest improvement over the other algorithm combinations at the intermediate conservation levels of .6 and .7 (about the level of the oldest identifiable transposons in the human genome). At higher conservation levels the CBW&DMS algorithms together do not outperform DMS alone, though both do better than BW or CBW alone. In an analogous simulation study for protein sequences (data not shown), the new algorithms all outperform BW at every conservation level, with DMS providing a greater improvement than CBW, and the new algorithms together demonstrating the greatest improvement.


Transposon identification using profile HMMs.

Edlefsen PT, Liu JS - BMC Genomics (2010)

DNA training data simulation results. Each row in (a) and set of bars in (b) corresponds to a different conservation level in the 4 "true profiles". The first column and the top bar (white) depicts the average log-probability of the training sequences using the 16 starting profiles. The second (amber) depicts the average log-probability of the training sequences using the true profiles. The third (red) depicts the average log-probability using the Baum-Welch (BW) algorithm. The fourth (blue) depicts using the Conditional Baum-Welch (CBW) algorithm. The fifth (yellow) depicts using the Dynamic Model Surgery (DMS) algorithm with the BW algorithm, and the sixth (green) depicts using the CBW and DMS algorithms together. All bars in (b) have been shifted so that the lowest bar is at 0. The new algorithms all outperform BW at conservation levels above .6. At .5 and below, CBW does not outperform BW. At the highest levels, both algorithms employing DMS perform equally well.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2822524&req=5

Figure 3: DNA training data simulation results. Each row in (a) and set of bars in (b) corresponds to a different conservation level in the 4 "true profiles". The first column and the top bar (white) depicts the average log-probability of the training sequences using the 16 starting profiles. The second (amber) depicts the average log-probability of the training sequences using the true profiles. The third (red) depicts the average log-probability using the Baum-Welch (BW) algorithm. The fourth (blue) depicts using the Conditional Baum-Welch (CBW) algorithm. The fifth (yellow) depicts using the Dynamic Model Surgery (DMS) algorithm with the BW algorithm, and the sixth (green) depicts using the CBW and DMS algorithms together. All bars in (b) have been shifted so that the lowest bar is at 0. The new algorithms all outperform BW at conservation levels above .6. At .5 and below, CBW does not outperform BW. At the highest levels, both algorithms employing DMS perform equally well.
Mentions: Figure 3 depicts DNA training-data results averaged over 16 runs (four starting parameter values for each of four training sets) at each conservation level. Figure 4 depicts DNA test-data results averaged over the 16 runs at each conservation level. In both the training results and the test results, the new algorithms all outperform BW at conservation levels above .5, with DMS providing a greater improvement than CBW. These results indicate that the combination of the CBW and DMS algorithms provides the greatest improvement over the other algorithm combinations at the intermediate conservation levels of .6 and .7 (about the level of the oldest identifiable transposons in the human genome). At higher conservation levels the CBW&DMS algorithms together do not outperform DMS alone, though both do better than BW or CBW alone. In an analogous simulation study for protein sequences (data not shown), the new algorithms all outperform BW at every conservation level, with DMS providing a greater improvement than CBW, and the new algorithms together demonstrating the greatest improvement.

Bottom Line: They are known to affect transcriptional regulation in several different ways, and are implicated in many human diseases.Modeling transposon-derived genomic content is difficult because they are poorly conserved.The DNA domain is especially problematic for the Baum-Welch algorithm, since it has only four letters as opposed to the twenty residues of the amino acid alphabet.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Statistics, Harvard University, Cambridge, MA, USA. edlefsen@stat.harvard.edu

ABSTRACT

Background: Transposons are "jumping genes" that account for large quantities of repetitive content in genomes. They are known to affect transcriptional regulation in several different ways, and are implicated in many human diseases. Transposons are related to microRNAs and viruses, and many genes, pseudogenes, and gene promoters are derived from transposons or have origins in transposon-induced duplication. Modeling transposon-derived genomic content is difficult because they are poorly conserved. Profile hidden Markov models (profile HMMs), widely used for protein sequence family modeling, are rarely used for modeling DNA sequence families. The algorithm commonly used to estimate the parameters of profile HMMs, Baum-Welch, is prone to prematurely converge to local optima. The DNA domain is especially problematic for the Baum-Welch algorithm, since it has only four letters as opposed to the twenty residues of the amino acid alphabet.

Results: We demonstrate with a simulation study and with an application to modeling the MIR family of transposons that two recently introduced methods, Conditional Baum-Welch and Dynamic Model Surgery, achieve better estimates of the parameters of profile HMMs across a range of conditions.

Conclusions: We argue that these new algorithms expand the range of potential applications of profile HMMs to many important DNA sequence family modeling problems, including that of searching for and modeling the virus-like transposons that are found in all known genomes.

Show MeSH
Related in: MedlinePlus