Limits...
A new repeat-masking method enables specific detection of homologous sequences.

Frith MC - Nucleic Acids Res. (2010)

Bottom Line: Homology search is confounded by simple repeats, which give rise to strong similarities that are not homologies.This method thoroughly eliminates spurious homology predictions for DNA-DNA, protein-protein and DNA-protein comparisons.Moreover, it enables accurate homology search for non-coding DNA with extreme A + T composition.

View Article: PubMed Central - PubMed

Affiliation: Computational Biology Research Center, Institute for Advanced Industrial Science and Technology, Sequence Analysis Team, 2-4-7 Aomi, Koto-ku, Tokyo 135-0064, Japan. martin@cbrc.jp

ABSTRACT
Biological sequences are often analyzed by detecting homologous regions between them. Homology search is confounded by simple repeats, which give rise to strong similarities that are not homologies. Standard repeat-masking methods fail to eliminate this problem, and they are especially ill-suited to AT-rich DNA such as malaria and slime-mould genomes. We present a new repeat-masking method, TANTAN, which is motivated by the mechanisms that create simple repeats. This method thoroughly eliminates spurious homology predictions for DNA-DNA, protein-protein and DNA-protein comparisons. Moreover, it enables accurate homology search for non-coding DNA with extreme A + T composition.

Show MeSH

Related in: MedlinePlus

Alignments of reversed sequences after repeat-masking method. The dashed line is the observed number of alignments, and the solid line is the expected number for random sequences. Alignments between: (A) the C. elegans genome and the reversed P. pacificus genome, after masking both with DustMasker; (B) vertebrate proteins and reversed plant proteins, after masking both with SegMasker; (C) the human genome and the reversed opossum genome, after masking both with trf; (D) the P. falciparum genome and the reversed D. discoideum genome, after masking both with DustMasker.
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3045581&req=5

Figure 4: Alignments of reversed sequences after repeat-masking method. The dashed line is the observed number of alignments, and the solid line is the expected number for random sequences. Alignments between: (A) the C. elegans genome and the reversed P. pacificus genome, after masking both with DustMasker; (B) vertebrate proteins and reversed plant proteins, after masking both with SegMasker; (C) the human genome and the reversed opossum genome, after masking both with trf; (D) the P. falciparum genome and the reversed D. discoideum genome, after masking both with DustMasker.

Mentions: We previously found that trf with newly tuned parameters eliminates spurious DNA alignments quite effectively (3). It is not perfect, however. If we compare the human genome to the reversed opossum genome after masking both with trf, we find significantly more and stronger similarities than expected for random sequences (Figure 4C). This is partly because trf is not designed to find compound repeats (Figure 2C): it looks for repeats where each unit is similar to a consensus sequence (11).Figure 4.


A new repeat-masking method enables specific detection of homologous sequences.

Frith MC - Nucleic Acids Res. (2010)

Alignments of reversed sequences after repeat-masking method. The dashed line is the observed number of alignments, and the solid line is the expected number for random sequences. Alignments between: (A) the C. elegans genome and the reversed P. pacificus genome, after masking both with DustMasker; (B) vertebrate proteins and reversed plant proteins, after masking both with SegMasker; (C) the human genome and the reversed opossum genome, after masking both with trf; (D) the P. falciparum genome and the reversed D. discoideum genome, after masking both with DustMasker.
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3045581&req=5

Figure 4: Alignments of reversed sequences after repeat-masking method. The dashed line is the observed number of alignments, and the solid line is the expected number for random sequences. Alignments between: (A) the C. elegans genome and the reversed P. pacificus genome, after masking both with DustMasker; (B) vertebrate proteins and reversed plant proteins, after masking both with SegMasker; (C) the human genome and the reversed opossum genome, after masking both with trf; (D) the P. falciparum genome and the reversed D. discoideum genome, after masking both with DustMasker.
Mentions: We previously found that trf with newly tuned parameters eliminates spurious DNA alignments quite effectively (3). It is not perfect, however. If we compare the human genome to the reversed opossum genome after masking both with trf, we find significantly more and stronger similarities than expected for random sequences (Figure 4C). This is partly because trf is not designed to find compound repeats (Figure 2C): it looks for repeats where each unit is similar to a consensus sequence (11).Figure 4.

Bottom Line: Homology search is confounded by simple repeats, which give rise to strong similarities that are not homologies.This method thoroughly eliminates spurious homology predictions for DNA-DNA, protein-protein and DNA-protein comparisons.Moreover, it enables accurate homology search for non-coding DNA with extreme A + T composition.

View Article: PubMed Central - PubMed

Affiliation: Computational Biology Research Center, Institute for Advanced Industrial Science and Technology, Sequence Analysis Team, 2-4-7 Aomi, Koto-ku, Tokyo 135-0064, Japan. martin@cbrc.jp

ABSTRACT
Biological sequences are often analyzed by detecting homologous regions between them. Homology search is confounded by simple repeats, which give rise to strong similarities that are not homologies. Standard repeat-masking methods fail to eliminate this problem, and they are especially ill-suited to AT-rich DNA such as malaria and slime-mould genomes. We present a new repeat-masking method, TANTAN, which is motivated by the mechanisms that create simple repeats. This method thoroughly eliminates spurious homology predictions for DNA-DNA, protein-protein and DNA-protein comparisons. Moreover, it enables accurate homology search for non-coding DNA with extreme A + T composition.

Show MeSH
Related in: MedlinePlus