Limits...
An optimized procedure greatly improves EST vector contamination removal.

Chen YA, Lin CC, Wang CD, Wu HB, Hwang PI - BMC Genomics (2007)

Bottom Line: The modified trimming procedure for SeqClean was also compared with the trimming efficiency of the other two popular programs, LUCY2, and cross_match.Vector contamination in dbEST was also investigated in this study: 2203 out of the 48212 ESTs sampled from dbEST (2007-04-18 freeze) were found to match sequences in UNIVEC.Based on the results presented here, we feel that our modified procedure with SeqClean should be recommended to all researchers for the task of vector removal from EST or genomic sequences.

View Article: PubMed Central - HTML - PubMed

Affiliation: Bioinformatics Core Laboratory, Agricultural Biotechnology Research Center, Academia Sinica, Taipei, Taiwan. chenyian@gate.sinica.edu.tw

ABSTRACT

Background: The enormous amount of sequence data available in the public domain database has been a gold mine for researchers exploring various themes in life sciences, and hence the quality of such data is of serious concern to researchers. Removal of vector contamination is one of the most significant operations to obtain accurate sequence data containing only a cDNA insert from the basecalls output by an automatic DNA sequencer. Popular bioinformatics programs to accomplish vector trimming include LUCY, cross_match and SeqClean.

Results: In a recent study, where the program SeqClean was used to remove vector contamination from our test set of EST data compiled through various library construction systems, however, a significant number of errors remained after preliminary trimming. These errors were later almost completely corrected by simply using a re-linearized form of the cloning vector to compare against the target ESTs. The modified trimming procedure for SeqClean was also compared with the trimming efficiency of the other two popular programs, LUCY2, and cross_match. Using SeqClean with a re-linearized form of the cloning vector significantly surpassed the other two programs in all tested conditions, while the performance of the other two programs was not influenced by the modified procedure. Vector contamination in dbEST was also investigated in this study: 2203 out of the 48212 ESTs sampled from dbEST (2007-04-18 freeze) were found to match sequences in UNIVEC.

Conclusion: Vector contamination remains a serious concern to the data quality in the public sequence database nowadays. Based on the results presented here, we feel that our modified procedure with SeqClean should be recommended to all researchers for the task of vector removal from EST or genomic sequences.

Show MeSH
Illustration of some trimming details. The shaded area highlights the range covering the 30% from either end of the EST. According to the original SeqClean design, the vector contaminant is recognized only if some or all of the similar vector sequence is identified within this range. The boxes in blue indicate the vector-derived sequence. The yellow open boxes represent cDNA inserts and the green bars show the low quality regions. The small stars indicate where the number 1 base is located by CVS coordinates. The boxes in red specify the product of SeqClean trimming. Comments for each of the three listed trimming situations are denoted to their right. Condition A indicates those ESTs which were mistakenly trashed. Condition B shows incomplete trimming and condition C is an example of correct trimming. Example ESTs corresponding to each of the three conditions are shown in the table below, where the position numbering followed the coordinates of the untrimmed EST sequences.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2194723&req=5

Figure 1: Illustration of some trimming details. The shaded area highlights the range covering the 30% from either end of the EST. According to the original SeqClean design, the vector contaminant is recognized only if some or all of the similar vector sequence is identified within this range. The boxes in blue indicate the vector-derived sequence. The yellow open boxes represent cDNA inserts and the green bars show the low quality regions. The small stars indicate where the number 1 base is located by CVS coordinates. The boxes in red specify the product of SeqClean trimming. Comments for each of the three listed trimming situations are denoted to their right. Condition A indicates those ESTs which were mistakenly trashed. Condition B shows incomplete trimming and condition C is an example of correct trimming. Example ESTs corresponding to each of the three conditions are shown in the table below, where the position numbering followed the coordinates of the untrimmed EST sequences.

Mentions: We performed vector trimming (using the SeqClean program downloaded from TIGR) to clean up thousands of EST sequences derived by automatic sequencing of randomly picked clones of tomato cDNA libraries prepared at our facility (manuscript in preparation). After initial SeqClean trimming, some sequences, when examined with BLAST against the nucleotide sequence of their corresponding cloning vectors at scores above 50, were still found to contain a significant amount of unwanted vector sequences within the trimmed segments. The problem was investigated further by closely examining some of the imperfectly trimmed ESTs. Representative examples are shown in Figure 1. With condition A in Figure 1, SeqClean overlooked two consecutive DNA segments in an EST including a vector fragment and a cDNA insert located within 30% range of the 5' end. The entire EST sequence was therefore considered to contain no insert and was trashed by the program, even though the cDNA insert was longer than 100 bases. For other EST sequences, such as in condition B, the entire vector fragment, extending from the base numbered 1 in its conventional expression to the ligation site, was mistakenly recognized as part of the cDNA insert by the SeqClean program and remained untrimmed by SeqClean processing.


An optimized procedure greatly improves EST vector contamination removal.

Chen YA, Lin CC, Wang CD, Wu HB, Hwang PI - BMC Genomics (2007)

Illustration of some trimming details. The shaded area highlights the range covering the 30% from either end of the EST. According to the original SeqClean design, the vector contaminant is recognized only if some or all of the similar vector sequence is identified within this range. The boxes in blue indicate the vector-derived sequence. The yellow open boxes represent cDNA inserts and the green bars show the low quality regions. The small stars indicate where the number 1 base is located by CVS coordinates. The boxes in red specify the product of SeqClean trimming. Comments for each of the three listed trimming situations are denoted to their right. Condition A indicates those ESTs which were mistakenly trashed. Condition B shows incomplete trimming and condition C is an example of correct trimming. Example ESTs corresponding to each of the three conditions are shown in the table below, where the position numbering followed the coordinates of the untrimmed EST sequences.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2194723&req=5

Figure 1: Illustration of some trimming details. The shaded area highlights the range covering the 30% from either end of the EST. According to the original SeqClean design, the vector contaminant is recognized only if some or all of the similar vector sequence is identified within this range. The boxes in blue indicate the vector-derived sequence. The yellow open boxes represent cDNA inserts and the green bars show the low quality regions. The small stars indicate where the number 1 base is located by CVS coordinates. The boxes in red specify the product of SeqClean trimming. Comments for each of the three listed trimming situations are denoted to their right. Condition A indicates those ESTs which were mistakenly trashed. Condition B shows incomplete trimming and condition C is an example of correct trimming. Example ESTs corresponding to each of the three conditions are shown in the table below, where the position numbering followed the coordinates of the untrimmed EST sequences.
Mentions: We performed vector trimming (using the SeqClean program downloaded from TIGR) to clean up thousands of EST sequences derived by automatic sequencing of randomly picked clones of tomato cDNA libraries prepared at our facility (manuscript in preparation). After initial SeqClean trimming, some sequences, when examined with BLAST against the nucleotide sequence of their corresponding cloning vectors at scores above 50, were still found to contain a significant amount of unwanted vector sequences within the trimmed segments. The problem was investigated further by closely examining some of the imperfectly trimmed ESTs. Representative examples are shown in Figure 1. With condition A in Figure 1, SeqClean overlooked two consecutive DNA segments in an EST including a vector fragment and a cDNA insert located within 30% range of the 5' end. The entire EST sequence was therefore considered to contain no insert and was trashed by the program, even though the cDNA insert was longer than 100 bases. For other EST sequences, such as in condition B, the entire vector fragment, extending from the base numbered 1 in its conventional expression to the ligation site, was mistakenly recognized as part of the cDNA insert by the SeqClean program and remained untrimmed by SeqClean processing.

Bottom Line: The modified trimming procedure for SeqClean was also compared with the trimming efficiency of the other two popular programs, LUCY2, and cross_match.Vector contamination in dbEST was also investigated in this study: 2203 out of the 48212 ESTs sampled from dbEST (2007-04-18 freeze) were found to match sequences in UNIVEC.Based on the results presented here, we feel that our modified procedure with SeqClean should be recommended to all researchers for the task of vector removal from EST or genomic sequences.

View Article: PubMed Central - HTML - PubMed

Affiliation: Bioinformatics Core Laboratory, Agricultural Biotechnology Research Center, Academia Sinica, Taipei, Taiwan. chenyian@gate.sinica.edu.tw

ABSTRACT

Background: The enormous amount of sequence data available in the public domain database has been a gold mine for researchers exploring various themes in life sciences, and hence the quality of such data is of serious concern to researchers. Removal of vector contamination is one of the most significant operations to obtain accurate sequence data containing only a cDNA insert from the basecalls output by an automatic DNA sequencer. Popular bioinformatics programs to accomplish vector trimming include LUCY, cross_match and SeqClean.

Results: In a recent study, where the program SeqClean was used to remove vector contamination from our test set of EST data compiled through various library construction systems, however, a significant number of errors remained after preliminary trimming. These errors were later almost completely corrected by simply using a re-linearized form of the cloning vector to compare against the target ESTs. The modified trimming procedure for SeqClean was also compared with the trimming efficiency of the other two popular programs, LUCY2, and cross_match. Using SeqClean with a re-linearized form of the cloning vector significantly surpassed the other two programs in all tested conditions, while the performance of the other two programs was not influenced by the modified procedure. Vector contamination in dbEST was also investigated in this study: 2203 out of the 48212 ESTs sampled from dbEST (2007-04-18 freeze) were found to match sequences in UNIVEC.

Conclusion: Vector contamination remains a serious concern to the data quality in the public sequence database nowadays. Based on the results presented here, we feel that our modified procedure with SeqClean should be recommended to all researchers for the task of vector removal from EST or genomic sequences.

Show MeSH