Limits...
Comparison of computational methods for identifying translation initiation sites in EST data.

Nadershahi A, Fahrenkrug SC, Ellis LB - BMC Bioinformatics (2004)

Bottom Line: Expressed Sequence Tag (EST) sequences are generally single-strand, single-pass sequences, only 200-600 nucleotides long, contain errors resulting in frame shifts, and represent different parts of their parent cDNA.We have compared five methods to predict translation initiation sites in EST data: first-ATG, ESTScan, Diogenes, Netstart, and ATGpr.These tools and materials may open an avenue for future improvements in start site prediction and EST analysis.

View Article: PubMed Central - HTML - PubMed

Affiliation: College of Biological Science, University of Minnesota, St. Paul, MN 55108, USA. nade0043@umn.edu

ABSTRACT

Background: Expressed Sequence Tag (EST) sequences are generally single-strand, single-pass sequences, only 200-600 nucleotides long, contain errors resulting in frame shifts, and represent different parts of their parent cDNA. If the cDNAs contain translation initiation sites, they may be suitable for functional genomics studies. We have compared five methods to predict translation initiation sites in EST data: first-ATG, ESTScan, Diogenes, Netstart, and ATGpr.

Results: A dataset of 100 EST sequences, 50 with and 50 without, translation initiation sites, was created. Based on analysis of this dataset, ATGpr is found to be the most accurate for predicting the presence versus absence of translation initiation sites. With a maximum accuracy of 76%, ATGpr more accurately predicts the position or absence of translation initiation sites than NetStart (57%) or Diogenes (50%). ATGpr similarly excels when start sites are known to be present (90%), whereas NetStart achieves only 60% overall accuracy. As a baseline for comparison, choosing the first ATG correctly identifies the translation initiation site in 74% of the sequences. ESTScan and Diogenes, consistent with their intended use, are able to identify open reading frames, but are unable to determine the precise position of translation initiation sites.

Conclusions: ATGpr demonstrates high sensitivity, specificity, and overall accuracy in identifying start sites while also rejecting incomplete sequences. A database of EST sequences suitable for validating programs for translation initiation site prediction is now available. These tools and materials may open an avenue for future improvements in start site prediction and EST analysis.

Show MeSH

Related in: MedlinePlus

Percentage of true initiation sites correctly identified. The overall percent accuracy of each program over the 50 sequences that contain true start sites is shown. No cutoffs are used; the highest-scoring prediction for each sequence is considered for each method.
© Copyright Policy
Related In: Results  -  Collection


getmorefigures.php?uid=PMC375524&req=5

Figure 5: Percentage of true initiation sites correctly identified. The overall percent accuracy of each program over the 50 sequences that contain true start sites is shown. No cutoffs are used; the highest-scoring prediction for each sequence is considered for each method.

Mentions: The overall percent accuracy of each program over the 50 sequences that contain true TISs is shown in Figure 5. No thresholds are used; the highest-scoring prediction for each sequence is considered for each method. Simply choosing the first ATG correctly identifies the TIS in 74% of the sequences that contain true TISs. This decrease from the theoretical 90% accuracy of this method (explained above) is most likely due to the frequent incompleteness of EST sequences. ATGpr performs extremely well on this limited dataset, correctly identifying the translation initiation site of 90% of the sequences. Surprisingly, NetStart is less accurate (60% correctly predicted) than the first-ATG method. Again, as expected, Diogenes' and ESTScan's accuracies are extremely low, respectively 0% and 2% correctly predicted.


Comparison of computational methods for identifying translation initiation sites in EST data.

Nadershahi A, Fahrenkrug SC, Ellis LB - BMC Bioinformatics (2004)

Percentage of true initiation sites correctly identified. The overall percent accuracy of each program over the 50 sequences that contain true start sites is shown. No cutoffs are used; the highest-scoring prediction for each sequence is considered for each method.
© Copyright Policy
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC375524&req=5

Figure 5: Percentage of true initiation sites correctly identified. The overall percent accuracy of each program over the 50 sequences that contain true start sites is shown. No cutoffs are used; the highest-scoring prediction for each sequence is considered for each method.
Mentions: The overall percent accuracy of each program over the 50 sequences that contain true TISs is shown in Figure 5. No thresholds are used; the highest-scoring prediction for each sequence is considered for each method. Simply choosing the first ATG correctly identifies the TIS in 74% of the sequences that contain true TISs. This decrease from the theoretical 90% accuracy of this method (explained above) is most likely due to the frequent incompleteness of EST sequences. ATGpr performs extremely well on this limited dataset, correctly identifying the translation initiation site of 90% of the sequences. Surprisingly, NetStart is less accurate (60% correctly predicted) than the first-ATG method. Again, as expected, Diogenes' and ESTScan's accuracies are extremely low, respectively 0% and 2% correctly predicted.

Bottom Line: Expressed Sequence Tag (EST) sequences are generally single-strand, single-pass sequences, only 200-600 nucleotides long, contain errors resulting in frame shifts, and represent different parts of their parent cDNA.We have compared five methods to predict translation initiation sites in EST data: first-ATG, ESTScan, Diogenes, Netstart, and ATGpr.These tools and materials may open an avenue for future improvements in start site prediction and EST analysis.

View Article: PubMed Central - HTML - PubMed

Affiliation: College of Biological Science, University of Minnesota, St. Paul, MN 55108, USA. nade0043@umn.edu

ABSTRACT

Background: Expressed Sequence Tag (EST) sequences are generally single-strand, single-pass sequences, only 200-600 nucleotides long, contain errors resulting in frame shifts, and represent different parts of their parent cDNA. If the cDNAs contain translation initiation sites, they may be suitable for functional genomics studies. We have compared five methods to predict translation initiation sites in EST data: first-ATG, ESTScan, Diogenes, Netstart, and ATGpr.

Results: A dataset of 100 EST sequences, 50 with and 50 without, translation initiation sites, was created. Based on analysis of this dataset, ATGpr is found to be the most accurate for predicting the presence versus absence of translation initiation sites. With a maximum accuracy of 76%, ATGpr more accurately predicts the position or absence of translation initiation sites than NetStart (57%) or Diogenes (50%). ATGpr similarly excels when start sites are known to be present (90%), whereas NetStart achieves only 60% overall accuracy. As a baseline for comparison, choosing the first ATG correctly identifies the translation initiation site in 74% of the sequences. ESTScan and Diogenes, consistent with their intended use, are able to identify open reading frames, but are unable to determine the precise position of translation initiation sites.

Conclusions: ATGpr demonstrates high sensitivity, specificity, and overall accuracy in identifying start sites while also rejecting incomplete sequences. A database of EST sequences suitable for validating programs for translation initiation site prediction is now available. These tools and materials may open an avenue for future improvements in start site prediction and EST analysis.

Show MeSH
Related in: MedlinePlus