Limits...
RAPSearch: a fast protein similarity search tool for short reads.

Ye Y, Choi JH, Tang H - BMC Bioinformatics (2011)

Bottom Line: For short reads (translated in 6 frames) we tested, RAPSearch achieved ~20-90 times speedup as compared to BLASTX.RAPSearch missed only a small fraction (~1.3-3.2%) of BLASTX similarity hits, but it also discovered additional homologous proteins (~0.3-2.1%) that BLASTX missed.By contrast, BLAT, a tool that is even slightly faster than RAPSearch, had significant loss of sensitivity as compared to RAPSearch and BLAST.

View Article: PubMed Central - HTML - PubMed

Affiliation: School of Informatics and Computing, Indiana University, Bloomington, IN 47408, USA. yye@indiana.edu

ABSTRACT

Background: Next Generation Sequencing (NGS) is producing enormous corpuses of short DNA reads, affecting emerging fields like metagenomics. Protein similarity search--a key step to achieve annotation of protein-coding genes in these short reads, and identification of their biological functions--faces daunting challenges because of the very sizes of the short read datasets.

Results: We developed a fast protein similarity search tool RAPSearch that utilizes a reduced amino acid alphabet and suffix array to detect seeds of flexible length. For short reads (translated in 6 frames) we tested, RAPSearch achieved ~20-90 times speedup as compared to BLASTX. RAPSearch missed only a small fraction (~1.3-3.2%) of BLASTX similarity hits, but it also discovered additional homologous proteins (~0.3-2.1%) that BLASTX missed. By contrast, BLAT, a tool that is even slightly faster than RAPSearch, had significant loss of sensitivity as compared to RAPSearch and BLAST.

Conclusions: RAPSearch is implemented as open-source software and is accessible at http://omics.informatics.indiana.edu/mg/RAPSearch. It enables faster protein similarity search. The application of RAPSearch in metageomics has also been demonstrated.

Show MeSH
Comparison of the similarity search sensitivity at different E-value cutoffs. RAPSearch, BLAST (and BLAST+, using default parameters), and (protein) BLAT were compared on the same query dataset (4440037). The total number of queries that have at least one homolog in the IMG protein sequence database (based on the corresponding E-value cutoff) was used in (A), whereas the total number of all significant hits (up to 100 hits per query) was used in (B). Note that BLAST and BLAST+ have almost identical sensitivity (but BLAST+ is twice as slow). For this dataset search, BLAST, RAPSearch and BLAT used 154, 3.5 and 2.7 CPU hours (on Intel Xeon 2.93 GHz), respectively.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3113943&req=5

Figure 3: Comparison of the similarity search sensitivity at different E-value cutoffs. RAPSearch, BLAST (and BLAST+, using default parameters), and (protein) BLAT were compared on the same query dataset (4440037). The total number of queries that have at least one homolog in the IMG protein sequence database (based on the corresponding E-value cutoff) was used in (A), whereas the total number of all significant hits (up to 100 hits per query) was used in (B). Note that BLAST and BLAST+ have almost identical sensitivity (but BLAST+ is twice as slow). For this dataset search, BLAST, RAPSearch and BLAT used 154, 3.5 and 2.7 CPU hours (on Intel Xeon 2.93 GHz), respectively.

Mentions: The detailed comparison of the performances by BLAST (and BLAST+), BLAT and RAPSearch on one query dataset is shown in Figure 3. (See Supplementary Figures 2 and 3 in Additional File 1 for detailed comparison for the TS28 and TS50 datasets.) RAPSearch tends to miss some distant similarities, but better captures closely related proteins. Under the stringent E-value cutoffs (e.g, 1e-3 or 1e-5 as used in most metagenomic studies [7]), RAPSearch has minimal loss of sensitivity as compared to BLAST. By contrast, BLAT tends to miss more similarity hits (Figure 3). Note that the difference at the query level (e.g., how many queries have significant hits as seen in Figure 3A) is smaller than the difference at the level of individual hits (Figure 3B). We also tested the performance of RAPSearch as compared to BLAST when searching against different protein databases, and the results showed consistent speedup by RAPSearch (see Table 3). We examined some of the similarities that are missed by RAPSearch--they are usually due to the lack of proper seeds between the query and the subject protein sequences. Interestingly, RAPSearch also detected some homologous proteins that are missed by BLAST search. And there is no obvious significance (measured by E-value) difference between the unique hits detected by either RAPSearch or BLAST (but not both) (see Supplementary Figure 4 in Additional File 1). An example of similarity detected only by RAPSearch is shown in Supplementary Figure 5 (Additional File 1).


RAPSearch: a fast protein similarity search tool for short reads.

Ye Y, Choi JH, Tang H - BMC Bioinformatics (2011)

Comparison of the similarity search sensitivity at different E-value cutoffs. RAPSearch, BLAST (and BLAST+, using default parameters), and (protein) BLAT were compared on the same query dataset (4440037). The total number of queries that have at least one homolog in the IMG protein sequence database (based on the corresponding E-value cutoff) was used in (A), whereas the total number of all significant hits (up to 100 hits per query) was used in (B). Note that BLAST and BLAST+ have almost identical sensitivity (but BLAST+ is twice as slow). For this dataset search, BLAST, RAPSearch and BLAT used 154, 3.5 and 2.7 CPU hours (on Intel Xeon 2.93 GHz), respectively.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3113943&req=5

Figure 3: Comparison of the similarity search sensitivity at different E-value cutoffs. RAPSearch, BLAST (and BLAST+, using default parameters), and (protein) BLAT were compared on the same query dataset (4440037). The total number of queries that have at least one homolog in the IMG protein sequence database (based on the corresponding E-value cutoff) was used in (A), whereas the total number of all significant hits (up to 100 hits per query) was used in (B). Note that BLAST and BLAST+ have almost identical sensitivity (but BLAST+ is twice as slow). For this dataset search, BLAST, RAPSearch and BLAT used 154, 3.5 and 2.7 CPU hours (on Intel Xeon 2.93 GHz), respectively.
Mentions: The detailed comparison of the performances by BLAST (and BLAST+), BLAT and RAPSearch on one query dataset is shown in Figure 3. (See Supplementary Figures 2 and 3 in Additional File 1 for detailed comparison for the TS28 and TS50 datasets.) RAPSearch tends to miss some distant similarities, but better captures closely related proteins. Under the stringent E-value cutoffs (e.g, 1e-3 or 1e-5 as used in most metagenomic studies [7]), RAPSearch has minimal loss of sensitivity as compared to BLAST. By contrast, BLAT tends to miss more similarity hits (Figure 3). Note that the difference at the query level (e.g., how many queries have significant hits as seen in Figure 3A) is smaller than the difference at the level of individual hits (Figure 3B). We also tested the performance of RAPSearch as compared to BLAST when searching against different protein databases, and the results showed consistent speedup by RAPSearch (see Table 3). We examined some of the similarities that are missed by RAPSearch--they are usually due to the lack of proper seeds between the query and the subject protein sequences. Interestingly, RAPSearch also detected some homologous proteins that are missed by BLAST search. And there is no obvious significance (measured by E-value) difference between the unique hits detected by either RAPSearch or BLAST (but not both) (see Supplementary Figure 4 in Additional File 1). An example of similarity detected only by RAPSearch is shown in Supplementary Figure 5 (Additional File 1).

Bottom Line: For short reads (translated in 6 frames) we tested, RAPSearch achieved ~20-90 times speedup as compared to BLASTX.RAPSearch missed only a small fraction (~1.3-3.2%) of BLASTX similarity hits, but it also discovered additional homologous proteins (~0.3-2.1%) that BLASTX missed.By contrast, BLAT, a tool that is even slightly faster than RAPSearch, had significant loss of sensitivity as compared to RAPSearch and BLAST.

View Article: PubMed Central - HTML - PubMed

Affiliation: School of Informatics and Computing, Indiana University, Bloomington, IN 47408, USA. yye@indiana.edu

ABSTRACT

Background: Next Generation Sequencing (NGS) is producing enormous corpuses of short DNA reads, affecting emerging fields like metagenomics. Protein similarity search--a key step to achieve annotation of protein-coding genes in these short reads, and identification of their biological functions--faces daunting challenges because of the very sizes of the short read datasets.

Results: We developed a fast protein similarity search tool RAPSearch that utilizes a reduced amino acid alphabet and suffix array to detect seeds of flexible length. For short reads (translated in 6 frames) we tested, RAPSearch achieved ~20-90 times speedup as compared to BLASTX. RAPSearch missed only a small fraction (~1.3-3.2%) of BLASTX similarity hits, but it also discovered additional homologous proteins (~0.3-2.1%) that BLASTX missed. By contrast, BLAT, a tool that is even slightly faster than RAPSearch, had significant loss of sensitivity as compared to RAPSearch and BLAST.

Conclusions: RAPSearch is implemented as open-source software and is accessible at http://omics.informatics.indiana.edu/mg/RAPSearch. It enables faster protein similarity search. The application of RAPSearch in metageomics has also been demonstrated.

Show MeSH