Limits...
RegExpBlasting (REB), a Regular Expression Blasting algorithm based on multiply aligned sequences.

Rubino F, Attimonelli M - BMC Bioinformatics (2009)

Bottom Line: The use of REB in PPNEMA updating, the PPNEMA "characterize your sequence" option clearly demonstrates the power of the method.Using REB can also rapidly solve any other bioinformatics problem, where the addition of a new sequence to a pre-existing cluster is required.The statistical tests carried out here show the powerful flexibility of the method.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Biochemistry and Molecular Biology E, Quagliariello - Bari, 70126, Italy. rubino.francesco@gmail.com

ABSTRACT

Background: One of the most frequent uses of bioinformatics tools concerns functional characterization of a newly produced nucleotide sequence (a query sequence) by applying Blast or FASTA against a set of sequences (the subject sequences). However, in some specific contexts, it is useful to compare the query sequence against a cluster such as a MultiAlignment (MA). We present here the RegExpBlasting (REB) algorithm, which compares an unclassified sequence with a dataset of patterns defined by application of Regular Expression rules to a given-as-input MA datasets. The REB algorithm workflow consists in i. the definition of a dataset of multialignments ii. the association of each MA to a pattern, defined by application of regular expression rules; iii. automatic characterization of a submitted biosequence according to the function of the sequences described by the pattern best matching the query sequence.

Results: An application of this algorithm is used in the "characterize your sequence" tool available in the PPNEMA resource. PPNEMA is a resource of Ribosomal Cistron sequences from various species, grouped according to nematode genera. It allows the retrieval of plant nematode multialigned sequences or the classification of new nematode rDNA sequences by applying REB. The same algorithm also supports automatic updating of the PPNEMA database. The present paper gives examples of the use of REB within PPNEMA.

Conclusion: The use of REB in PPNEMA updating, the PPNEMA "characterize your sequence" option clearly demonstrates the power of the method. Using REB can also rapidly solve any other bioinformatics problem, where the addition of a new sequence to a pre-existing cluster is required. The statistical tests carried out here show the powerful flexibility of the method.

Show MeSH
False positive (FP)/True Positive (TP) versus Minimal lengths in the RegExpBlasting application to PPNEMA updating. With increasing minimal length, FP decrease, although variations are unimportant for minimal length greater than 60; minimal length was therefore fixed to 60.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2697652&req=5

Figure 4: False positive (FP)/True Positive (TP) versus Minimal lengths in the RegExpBlasting application to PPNEMA updating. With increasing minimal length, FP decrease, although variations are unimportant for minimal length greater than 60; minimal length was therefore fixed to 60.

Mentions: As already emphasized in the Background, in bio-database management, the main problem is to keep the databases updated in the time. We present here a procedure, based on REB, which allows the PPNEMA database to be updated in an almost completely automatic process. Figure 3 reports the workflow of the entire process. The protocol starts by searching, through Entrez implemented at NCBI the nematode sequences whose entry date is later than the most recent PPNEMA updating. The sequences are extracted by grouping them according to genus. An example of a query is "Anguina [ORGN] AND (18S OR 28S OR 5.8S OR ITS1 OR ITS2) AND 2007/01/01:2008/01/01 [PUBLICATION DATE]". The resulting entries refer to the nucleotide sequences from Genera Anguina, coding for one or more elements of the RNA cistron and annotated after January 1 2007 and before January 1 2008. They are extracted, and a file containing their nucleotide sequences in FASTA format is submitted to "Normal Search", which is based on RegExpBlasting of each of the selected sequences against the PPNEMA MA dataset. Each new sequence is associated with a pool of matching patterns that we call "positives"; any unclassified sequences or negative results are marked as True Negatives (TN), i.e. new group-defining sequences, and False Negatives (FN). In the first application of the protocol described here, we obviously carried out many checks in the various steps, in order to verify the efficiency of the protocol as we have designed it, and to optimize it. The positives are submitted to an automatic annotator check, which establishes if the results are true positives (TP) or false positives (FP). TP entries, i.e. entries matching MA of the same genera and related to various elements of the same cistron, are annotated in PPNEMA. The FP are submitted to an Operator Check and distinguished into FP due to annotation errors, e.g. erroneous species attribution, or to short conserved among-genera sequences. In the latter case, the same sequence matches several genera; thus, to solve the problem, RegExpBlasting can modify the minimal length parameter (see Methods). By increasing the minimal length, the probability of random matches decreases, although some significant but short matches may be lost. Figure 4 shows that the minimal length 60 is a good compromise, yielding as many TP as possible and minimizing the number of FN. This implied exclusion from analysis of dataset patterns whose minimal length was less than 60. This value is applicable to the 884 phytoparasite sequences extracted. The algorithm allows minimal length to be changed dynamically, depending on the dataset in question. Once the minimal length is established, the protocol is tested by submitting the FN to the Scan-RegExpBlasting procedure, in which the parameters used are window length w, scanning step s and minimal length minl. Once fixed minl to 60, the results which show a cut-off (cf) value higher than 0.70 are selected (see Methods).


RegExpBlasting (REB), a Regular Expression Blasting algorithm based on multiply aligned sequences.

Rubino F, Attimonelli M - BMC Bioinformatics (2009)

False positive (FP)/True Positive (TP) versus Minimal lengths in the RegExpBlasting application to PPNEMA updating. With increasing minimal length, FP decrease, although variations are unimportant for minimal length greater than 60; minimal length was therefore fixed to 60.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2697652&req=5

Figure 4: False positive (FP)/True Positive (TP) versus Minimal lengths in the RegExpBlasting application to PPNEMA updating. With increasing minimal length, FP decrease, although variations are unimportant for minimal length greater than 60; minimal length was therefore fixed to 60.
Mentions: As already emphasized in the Background, in bio-database management, the main problem is to keep the databases updated in the time. We present here a procedure, based on REB, which allows the PPNEMA database to be updated in an almost completely automatic process. Figure 3 reports the workflow of the entire process. The protocol starts by searching, through Entrez implemented at NCBI the nematode sequences whose entry date is later than the most recent PPNEMA updating. The sequences are extracted by grouping them according to genus. An example of a query is "Anguina [ORGN] AND (18S OR 28S OR 5.8S OR ITS1 OR ITS2) AND 2007/01/01:2008/01/01 [PUBLICATION DATE]". The resulting entries refer to the nucleotide sequences from Genera Anguina, coding for one or more elements of the RNA cistron and annotated after January 1 2007 and before January 1 2008. They are extracted, and a file containing their nucleotide sequences in FASTA format is submitted to "Normal Search", which is based on RegExpBlasting of each of the selected sequences against the PPNEMA MA dataset. Each new sequence is associated with a pool of matching patterns that we call "positives"; any unclassified sequences or negative results are marked as True Negatives (TN), i.e. new group-defining sequences, and False Negatives (FN). In the first application of the protocol described here, we obviously carried out many checks in the various steps, in order to verify the efficiency of the protocol as we have designed it, and to optimize it. The positives are submitted to an automatic annotator check, which establishes if the results are true positives (TP) or false positives (FP). TP entries, i.e. entries matching MA of the same genera and related to various elements of the same cistron, are annotated in PPNEMA. The FP are submitted to an Operator Check and distinguished into FP due to annotation errors, e.g. erroneous species attribution, or to short conserved among-genera sequences. In the latter case, the same sequence matches several genera; thus, to solve the problem, RegExpBlasting can modify the minimal length parameter (see Methods). By increasing the minimal length, the probability of random matches decreases, although some significant but short matches may be lost. Figure 4 shows that the minimal length 60 is a good compromise, yielding as many TP as possible and minimizing the number of FN. This implied exclusion from analysis of dataset patterns whose minimal length was less than 60. This value is applicable to the 884 phytoparasite sequences extracted. The algorithm allows minimal length to be changed dynamically, depending on the dataset in question. Once the minimal length is established, the protocol is tested by submitting the FN to the Scan-RegExpBlasting procedure, in which the parameters used are window length w, scanning step s and minimal length minl. Once fixed minl to 60, the results which show a cut-off (cf) value higher than 0.70 are selected (see Methods).

Bottom Line: The use of REB in PPNEMA updating, the PPNEMA "characterize your sequence" option clearly demonstrates the power of the method.Using REB can also rapidly solve any other bioinformatics problem, where the addition of a new sequence to a pre-existing cluster is required.The statistical tests carried out here show the powerful flexibility of the method.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Biochemistry and Molecular Biology E, Quagliariello - Bari, 70126, Italy. rubino.francesco@gmail.com

ABSTRACT

Background: One of the most frequent uses of bioinformatics tools concerns functional characterization of a newly produced nucleotide sequence (a query sequence) by applying Blast or FASTA against a set of sequences (the subject sequences). However, in some specific contexts, it is useful to compare the query sequence against a cluster such as a MultiAlignment (MA). We present here the RegExpBlasting (REB) algorithm, which compares an unclassified sequence with a dataset of patterns defined by application of Regular Expression rules to a given-as-input MA datasets. The REB algorithm workflow consists in i. the definition of a dataset of multialignments ii. the association of each MA to a pattern, defined by application of regular expression rules; iii. automatic characterization of a submitted biosequence according to the function of the sequences described by the pattern best matching the query sequence.

Results: An application of this algorithm is used in the "characterize your sequence" tool available in the PPNEMA resource. PPNEMA is a resource of Ribosomal Cistron sequences from various species, grouped according to nematode genera. It allows the retrieval of plant nematode multialigned sequences or the classification of new nematode rDNA sequences by applying REB. The same algorithm also supports automatic updating of the PPNEMA database. The present paper gives examples of the use of REB within PPNEMA.

Conclusion: The use of REB in PPNEMA updating, the PPNEMA "characterize your sequence" option clearly demonstrates the power of the method. Using REB can also rapidly solve any other bioinformatics problem, where the addition of a new sequence to a pre-existing cluster is required. The statistical tests carried out here show the powerful flexibility of the method.

Show MeSH