Limits...
Improving prokaryotic transposable elements identification using a combination of de novo and profile HMM methods.

Kamoun C, Payen T, Hua-Van A, Filée J - BMC Genomics (2013)

Bottom Line: Many of the new elements correspond to ISs belonging to unknown or highly divergent families.The total number of MITEs has even doubled with the discovery of elements displaying very limited sequence similarities with their respective autonomous partners (mainly in the Inverted Repeats of the elements).For metagenomic data, profile HMM search should be favored, a BLAST-based step is only useful to the final annotation into groups and families.

View Article: PubMed Central - HTML - PubMed

Affiliation: Laboratoire Evolution, Génomes, Spéciation, CNRS UPR9034/Université Paris-Sud, Gif-sur-Yvette, France. jonathan.filee@legs.cnrs-gif.fr.

ABSTRACT

Background: Insertion Sequences (ISs) and their non-autonomous derivatives (MITEs) are important components of prokaryotic genomes inducing duplication, deletion, rearrangement or lateral gene transfers. Although ISs and MITEs are relatively simple and basic genetic elements, their detection remains a difficult task due to their remarkable sequence diversity. With the advent of high-throughput genome and metagenome sequencing technologies, the development of fast, reliable and sensitive methods of ISs and MITEs detection become an important challenge. So far, almost all studies dealing with prokaryotic transposons have used classical BLAST-based detection methods against reference libraries. Here we introduce alternative methods of detection either taking advantages of the structural properties of the elements (de novo methods) or using an additional library-based method using profile HMM searches.

Results: In this study, we have developed three different work flows dedicated to ISs and MITEs detection: the first two use de novo methods detecting either repeated sequences or presence of Inverted Repeats; the third one use 28 in-house transposase alignment profiles with HMM search methods. We have compared the respective performances of each method using a reference dataset of 30 archaeal and 30 bacterial genomes in addition to simulated and real metagenomes. Compared to a BLAST-based method using ISFinder as library, de novo methods significantly improve ISs and MITEs detection. For example, in the 30 archaeal genomes, we discovered 30 new elements (+20%) in addition to the 141 multi-copies elements already detected by the BLAST approach. Many of the new elements correspond to ISs belonging to unknown or highly divergent families. The total number of MITEs has even doubled with the discovery of elements displaying very limited sequence similarities with their respective autonomous partners (mainly in the Inverted Repeats of the elements). Concerning metagenomes, with the exception of short reads data (<300 bp) for which both techniques seem equally limited, profile HMM searches considerably ameliorate the detection of transposase encoding genes (up to +50%) generating low level of false positives compare to BLAST-based methods.

Conclusion: Compared to classical BLAST-based methods, the sensitivity of de novo and profile HMM methods developed in this study allow a better and more reliable detection of transposons in prokaryotic genomes and metagenomes. We believed that future studies implying ISs and MITEs identification in genomic data should combine at least one de novo and one library-based method, with optimal results obtained by running the two de novo methods in addition to a library-based search. For metagenomic data, profile HMM search should be favored, a BLAST-based step is only useful to the final annotation into groups and families.

Show MeSH
Simplified workflow of the profile HMM search. Running softwares are indicated in blue, rectangles schematize the different output files at each step.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3852290&req=5

Figure 2: Simplified workflow of the profile HMM search. Running softwares are indicated in blue, rectangles schematize the different output files at each step.

Mentions: Our de novo pipelines used two different approaches to detect ISs and MITEs (Figures 1 and 2). The first one called “Repeats search” (Figure 1) used the RepeatScout algorithm [15]. RepeatScout detects repeated sequences in a given genome and generates a consensus of these sequences (with default parameter and word size l = 9). The second path called “IRs search” (Figure 2) used the Palindrome software of the EMBOSS package [16] which identifies the IRs in the input sequences (with the following parameters: minpallen = 10, maxpallen = 50, gaplimit = 2000, nummismatches = 2). All sequences delimited by IR pairs are then extracted. In order to avoid doubleton, sequences bordered with IRs but present less than 2 times were removed. The consensus sequences generated by RepeatScout and the sequences delimited by IR following Palindrome were then clustered using UCLUST with default parameters [17]. At this stage, we have a list of clusters of putative MITEs and ISs. For the putative ISs (sequences larger than 500 bp) we compiled exhaustively all the related sequences (including truncated copies) using a BLASTN (E value = 10e-5) [18] against the genome. For the MITE candidates (sequences smaller than 500 bp) we need to identify the autonomous partners to be sure that these sequences are true transposable elements. We acquired the terminal 23 bp at each end of the sequence (which correspond roughly to the IRs of the elements) and used this sequence to BLASTN (E value = 10e-1) the genome. We obtained a list of sequences that are present between the two matching terminal 23 bp. These sequences were filtered to be smaller than 3 kb and doubletons were removed. Finally the nested elements (elements inserted in each other as Russian dolls) are separated and reconstructed separately. These potential partners were then blasted (BLASTX with E value = 10e-5) against the ISFinder database (04/2011 update) to be certain that these sequences are homologous to bona fide ISs. At the end of the process, we obtain a list of files with all the putative ISs and a file containing all the MITEs with the corresponding autonomous partners.


Improving prokaryotic transposable elements identification using a combination of de novo and profile HMM methods.

Kamoun C, Payen T, Hua-Van A, Filée J - BMC Genomics (2013)

Simplified workflow of the profile HMM search. Running softwares are indicated in blue, rectangles schematize the different output files at each step.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3852290&req=5

Figure 2: Simplified workflow of the profile HMM search. Running softwares are indicated in blue, rectangles schematize the different output files at each step.
Mentions: Our de novo pipelines used two different approaches to detect ISs and MITEs (Figures 1 and 2). The first one called “Repeats search” (Figure 1) used the RepeatScout algorithm [15]. RepeatScout detects repeated sequences in a given genome and generates a consensus of these sequences (with default parameter and word size l = 9). The second path called “IRs search” (Figure 2) used the Palindrome software of the EMBOSS package [16] which identifies the IRs in the input sequences (with the following parameters: minpallen = 10, maxpallen = 50, gaplimit = 2000, nummismatches = 2). All sequences delimited by IR pairs are then extracted. In order to avoid doubleton, sequences bordered with IRs but present less than 2 times were removed. The consensus sequences generated by RepeatScout and the sequences delimited by IR following Palindrome were then clustered using UCLUST with default parameters [17]. At this stage, we have a list of clusters of putative MITEs and ISs. For the putative ISs (sequences larger than 500 bp) we compiled exhaustively all the related sequences (including truncated copies) using a BLASTN (E value = 10e-5) [18] against the genome. For the MITE candidates (sequences smaller than 500 bp) we need to identify the autonomous partners to be sure that these sequences are true transposable elements. We acquired the terminal 23 bp at each end of the sequence (which correspond roughly to the IRs of the elements) and used this sequence to BLASTN (E value = 10e-1) the genome. We obtained a list of sequences that are present between the two matching terminal 23 bp. These sequences were filtered to be smaller than 3 kb and doubletons were removed. Finally the nested elements (elements inserted in each other as Russian dolls) are separated and reconstructed separately. These potential partners were then blasted (BLASTX with E value = 10e-5) against the ISFinder database (04/2011 update) to be certain that these sequences are homologous to bona fide ISs. At the end of the process, we obtain a list of files with all the putative ISs and a file containing all the MITEs with the corresponding autonomous partners.

Bottom Line: Many of the new elements correspond to ISs belonging to unknown or highly divergent families.The total number of MITEs has even doubled with the discovery of elements displaying very limited sequence similarities with their respective autonomous partners (mainly in the Inverted Repeats of the elements).For metagenomic data, profile HMM search should be favored, a BLAST-based step is only useful to the final annotation into groups and families.

View Article: PubMed Central - HTML - PubMed

Affiliation: Laboratoire Evolution, Génomes, Spéciation, CNRS UPR9034/Université Paris-Sud, Gif-sur-Yvette, France. jonathan.filee@legs.cnrs-gif.fr.

ABSTRACT

Background: Insertion Sequences (ISs) and their non-autonomous derivatives (MITEs) are important components of prokaryotic genomes inducing duplication, deletion, rearrangement or lateral gene transfers. Although ISs and MITEs are relatively simple and basic genetic elements, their detection remains a difficult task due to their remarkable sequence diversity. With the advent of high-throughput genome and metagenome sequencing technologies, the development of fast, reliable and sensitive methods of ISs and MITEs detection become an important challenge. So far, almost all studies dealing with prokaryotic transposons have used classical BLAST-based detection methods against reference libraries. Here we introduce alternative methods of detection either taking advantages of the structural properties of the elements (de novo methods) or using an additional library-based method using profile HMM searches.

Results: In this study, we have developed three different work flows dedicated to ISs and MITEs detection: the first two use de novo methods detecting either repeated sequences or presence of Inverted Repeats; the third one use 28 in-house transposase alignment profiles with HMM search methods. We have compared the respective performances of each method using a reference dataset of 30 archaeal and 30 bacterial genomes in addition to simulated and real metagenomes. Compared to a BLAST-based method using ISFinder as library, de novo methods significantly improve ISs and MITEs detection. For example, in the 30 archaeal genomes, we discovered 30 new elements (+20%) in addition to the 141 multi-copies elements already detected by the BLAST approach. Many of the new elements correspond to ISs belonging to unknown or highly divergent families. The total number of MITEs has even doubled with the discovery of elements displaying very limited sequence similarities with their respective autonomous partners (mainly in the Inverted Repeats of the elements). Concerning metagenomes, with the exception of short reads data (<300 bp) for which both techniques seem equally limited, profile HMM searches considerably ameliorate the detection of transposase encoding genes (up to +50%) generating low level of false positives compare to BLAST-based methods.

Conclusion: Compared to classical BLAST-based methods, the sensitivity of de novo and profile HMM methods developed in this study allow a better and more reliable detection of transposons in prokaryotic genomes and metagenomes. We believed that future studies implying ISs and MITEs identification in genomic data should combine at least one de novo and one library-based method, with optimal results obtained by running the two de novo methods in addition to a library-based search. For metagenomic data, profile HMM search should be favored, a BLAST-based step is only useful to the final annotation into groups and families.

Show MeSH