Limits...
FastBLAST: homology relationships for millions of proteins.

Price MN, Dehal PS, Arkin AP - PLoS ONE (2008)

Bottom Line: Once the first stage is completed, FastBLAST identifies homologs for the average query in less than 5 seconds (8.6 times faster than BLAST) and gives nearly identical results.For hits above 70 bits, FastBLAST identifies 98% of the top 3,250 hits per query.FastBLAST enables research groups that do not have supercomputers to analyze large protein sequence data sets.

View Article: PubMed Central - PubMed

Affiliation: Physical Biosciences Divison, Lawrence Berkeley National Laboratory, Berkeley, California, USA. morgannprice@yahoo.com

ABSTRACT

Background: All-versus-all BLAST, which searches for homologous pairs of sequences in a database of proteins, is used to identify potential orthologs, to find new protein families, and to provide rapid access to these homology relationships. As DNA sequencing accelerates and data sets grow, all-versus-all BLAST has become computationally demanding.

Methodology/principal findings: We present FastBLAST, a heuristic replacement for all-versus-all BLAST that relies on alignments of proteins to known families, obtained from tools such as PSI-BLAST and HMMer. FastBLAST avoids most of the work of all-versus-all BLAST by taking advantage of these alignments and by clustering similar sequences. FastBLAST runs in two stages: the first stage identifies additional families and aligns them, and the second stage quickly identifies the homologs of a query sequence, based on the alignments of the families, before generating pairwise alignments. On 6.53 million proteins from the non-redundant Genbank database ("NR"), FastBLAST identifies new families 25 times faster than all-versus-all BLAST. Once the first stage is completed, FastBLAST identifies homologs for the average query in less than 5 seconds (8.6 times faster than BLAST) and gives nearly identical results. For hits above 70 bits, FastBLAST identifies 98% of the top 3,250 hits per query.

Conclusions/significance: FastBLAST enables research groups that do not have supercomputers to analyze large protein sequence data sets. FastBLAST is open source software and is available at http://microbesonline.org/fastblast.

Show MeSH
FastBLAST misses mostly low-ranking hits and/or weak hits.We show the cumulative proportion of queries that have a miss within the top n hits. Note the log-scale for the x axis. The highest proportion is 10.8% because FastBLAST identified all of the top 3,250 homologs at 70 bits or greater for the other 89.2% of queries. We also show results if only higher-scoring hits are considered.
© Copyright Policy
Related In: Results  -  Collection


getmorefigures.php?uid=PMC2571987&req=5

pone-0003589-g002: FastBLAST misses mostly low-ranking hits and/or weak hits.We show the cumulative proportion of queries that have a miss within the top n hits. Note the log-scale for the x axis. The highest proportion is 10.8% because FastBLAST identified all of the top 3,250 homologs at 70 bits or greater for the other 89.2% of queries. We also show results if only higher-scoring hits are considered.

Mentions: To test the second stage of FastBLAST, we used both FastBLAST and BLAST to identify the top 3,250 hits for 2,000 randomly selected members of NR. (3,250 is 1/2,000 of the genes in NR.) BLAST took 40.8 seconds per query, while FastBLAST took 4.74 seconds per query, or 8.6 times faster. We believe that this is fast enough for interactive use (instead of pre-computing BLAST hits for every query). Among hits with scores of at least 70 bits, FastBLAST found 97.9% of the hits that BLAST found. As shown in Figure 2, FastBLAST correctly identified the top hit for every query (if the query had any homologs) and identified all 3,250 top homologs for all but 10.8% of the queries. For most of the remaining queries, the missed hits are weak or far down in the list. Thus, we doubt that the missed hits would be orthologs or would be useful for annotating the query's function.


FastBLAST: homology relationships for millions of proteins.

Price MN, Dehal PS, Arkin AP - PLoS ONE (2008)

FastBLAST misses mostly low-ranking hits and/or weak hits.We show the cumulative proportion of queries that have a miss within the top n hits. Note the log-scale for the x axis. The highest proportion is 10.8% because FastBLAST identified all of the top 3,250 homologs at 70 bits or greater for the other 89.2% of queries. We also show results if only higher-scoring hits are considered.
© Copyright Policy
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC2571987&req=5

pone-0003589-g002: FastBLAST misses mostly low-ranking hits and/or weak hits.We show the cumulative proportion of queries that have a miss within the top n hits. Note the log-scale for the x axis. The highest proportion is 10.8% because FastBLAST identified all of the top 3,250 homologs at 70 bits or greater for the other 89.2% of queries. We also show results if only higher-scoring hits are considered.
Mentions: To test the second stage of FastBLAST, we used both FastBLAST and BLAST to identify the top 3,250 hits for 2,000 randomly selected members of NR. (3,250 is 1/2,000 of the genes in NR.) BLAST took 40.8 seconds per query, while FastBLAST took 4.74 seconds per query, or 8.6 times faster. We believe that this is fast enough for interactive use (instead of pre-computing BLAST hits for every query). Among hits with scores of at least 70 bits, FastBLAST found 97.9% of the hits that BLAST found. As shown in Figure 2, FastBLAST correctly identified the top hit for every query (if the query had any homologs) and identified all 3,250 top homologs for all but 10.8% of the queries. For most of the remaining queries, the missed hits are weak or far down in the list. Thus, we doubt that the missed hits would be orthologs or would be useful for annotating the query's function.

Bottom Line: Once the first stage is completed, FastBLAST identifies homologs for the average query in less than 5 seconds (8.6 times faster than BLAST) and gives nearly identical results.For hits above 70 bits, FastBLAST identifies 98% of the top 3,250 hits per query.FastBLAST enables research groups that do not have supercomputers to analyze large protein sequence data sets.

View Article: PubMed Central - PubMed

Affiliation: Physical Biosciences Divison, Lawrence Berkeley National Laboratory, Berkeley, California, USA. morgannprice@yahoo.com

ABSTRACT

Background: All-versus-all BLAST, which searches for homologous pairs of sequences in a database of proteins, is used to identify potential orthologs, to find new protein families, and to provide rapid access to these homology relationships. As DNA sequencing accelerates and data sets grow, all-versus-all BLAST has become computationally demanding.

Methodology/principal findings: We present FastBLAST, a heuristic replacement for all-versus-all BLAST that relies on alignments of proteins to known families, obtained from tools such as PSI-BLAST and HMMer. FastBLAST avoids most of the work of all-versus-all BLAST by taking advantage of these alignments and by clustering similar sequences. FastBLAST runs in two stages: the first stage identifies additional families and aligns them, and the second stage quickly identifies the homologs of a query sequence, based on the alignments of the families, before generating pairwise alignments. On 6.53 million proteins from the non-redundant Genbank database ("NR"), FastBLAST identifies new families 25 times faster than all-versus-all BLAST. Once the first stage is completed, FastBLAST identifies homologs for the average query in less than 5 seconds (8.6 times faster than BLAST) and gives nearly identical results. For hits above 70 bits, FastBLAST identifies 98% of the top 3,250 hits per query.

Conclusions/significance: FastBLAST enables research groups that do not have supercomputers to analyze large protein sequence data sets. FastBLAST is open source software and is available at http://microbesonline.org/fastblast.

Show MeSH