Limits...
A method for identifying alternative or cryptic donor splice sites within gene and mRNA sequences. Comparisons among sequences from vertebrates, echinoderms and other groups.

Buckley KM, Florea LD, Smith LC - BMC Genomics (2009)

Bottom Line: The Purpuratus model was shown to predict splice signals better than a similar model created from vertebrate sequences.The data presented herein describe the first published analyses of echinoderm splice sites and suggest that the previous methods of identifying splice signals that are based largely on vertebrate sequences may be insufficient.Furthermore, alternative or trans-gene splicing does not appear to be acting as a diversification mechanism in the 185/333 gene family.

View Article: PubMed Central - HTML - PubMed

Affiliation: The Department of Biological Sciences, Washington University, Washington, DC 20052, USA. kbuckley@gmail.com

ABSTRACT

Background: As the amount of genome sequencing data grows, so does the problem of computational gene identification, and in particular, the splicing signals that flank exon borders. Traditional methods for identifying splicing signals have been created and optimized using sequences from model organisms, mostly vertebrate and yeast species. However, as genome sequencing extends across the animal kingdom and includes various invertebrate species, the need for mechanisms to recognize splice signals in these organisms increases as well. With that aim in mind, we generated a model for identifying donor and acceptor splice sites that was optimized using sequences from the purple sea urchin, Strongylocentrotus purpuratus. This model was then used to assess the possibility of alternative or cryptic splicing within the highly variable immune response gene family known as 185/333.

Results: A donor splice site model was generated from S. purpuratus sequences that incorporates non-adjacent dependences among positions within the 9 nt splice signal and uses position weight matrices to determine the probability that the site is used for splicing. The Purpuratus model was shown to predict splice signals better than a similar model created from vertebrate sequences. Although the Purpuratus model was able to correctly predict the true splice sites within the 185/333 genes, no evidence for alternative or trans-gene splicing was observed.

Conclusion: The data presented herein describe the first published analyses of echinoderm splice sites and suggest that the previous methods of identifying splice signals that are based largely on vertebrate sequences may be insufficient. Furthermore, alternative or trans-gene splicing does not appear to be acting as a diversification mechanism in the 185/333 gene family.

Show MeSH

Related in: MedlinePlus

Analysis of known positive and negative splice sites using the Purpuratus and Vertebrate splice site models. Histograms of the scores given to known positive (solid lines) and negative (dashed lines) splice sites were generated (bin size = 2) for the Purpuratus (A) and Vertebrate (B) splice site models by analyzing the genes used to generate the models (Additional file 2, 3, and 4; [28]). For example, 22% of the known positive sites received scores between 4 and 6. The average of the means (Table 3) is shown by a vertical dotted line. The gray region corresponds to N0.95, and P0.05 (Table 3), which flank the left and right side of the gray region, respectively, and are shown as dashed/dotted lines. The ✳ located on the 0.25% line indicate the mean of the positive and negative scores.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2721852&req=5

Figure 2: Analysis of known positive and negative splice sites using the Purpuratus and Vertebrate splice site models. Histograms of the scores given to known positive (solid lines) and negative (dashed lines) splice sites were generated (bin size = 2) for the Purpuratus (A) and Vertebrate (B) splice site models by analyzing the genes used to generate the models (Additional file 2, 3, and 4; [28]). For example, 22% of the known positive sites received scores between 4 and 6. The average of the means (Table 3) is shown by a vertical dotted line. The gray region corresponds to N0.95, and P0.05 (Table 3), which flank the left and right side of the gray region, respectively, and are shown as dashed/dotted lines. The ✳ located on the 0.25% line indicate the mean of the positive and negative scores.

Mentions: The performance of each model was assessed using six measures (Equations 2–7; see Methods) that required binary classification of the sites (either positive or negative). Therefore, it was necessary to define a threshold for each model that separated the positive and negative scores. Histograms of scores from known positive and negative sites were created from the sequences that were used to generate the models (Figure 2). In a perfect model, there would be no overlap between the scores assigned to the positive and negative sites, because the positive sequences would all receive scores higher than the negative sites. This was not possible, however, due to the degeneracy of the splicing machinery and because certain sequences may act as donor splice sites in some sequences but are not functional in others. These factors complicate the results such that the scores for the positive and negative sites overlap. A threshold was required to define each score as either positive or negative for the model assessments. The ideal threshold would maximize the number of positives that were classified as positive (true positives; TP) and minimize the number of negative scores incorrectly labeled as positive (false positives; FP). A number of thresholds were analyzed, including the average of all the scores, the score at which 95% of the positive scores were considered positive (P0.05), and the score at which 95% of the negative scores were considered negative (N0.95; Table 3). The best threshold was the average of the means of the positive and negative scores (Figure 2) because it provided the most favorable balance between the number of true positives and false positives, and was therefore used in the subsequent evaluations of the models.


A method for identifying alternative or cryptic donor splice sites within gene and mRNA sequences. Comparisons among sequences from vertebrates, echinoderms and other groups.

Buckley KM, Florea LD, Smith LC - BMC Genomics (2009)

Analysis of known positive and negative splice sites using the Purpuratus and Vertebrate splice site models. Histograms of the scores given to known positive (solid lines) and negative (dashed lines) splice sites were generated (bin size = 2) for the Purpuratus (A) and Vertebrate (B) splice site models by analyzing the genes used to generate the models (Additional file 2, 3, and 4; [28]). For example, 22% of the known positive sites received scores between 4 and 6. The average of the means (Table 3) is shown by a vertical dotted line. The gray region corresponds to N0.95, and P0.05 (Table 3), which flank the left and right side of the gray region, respectively, and are shown as dashed/dotted lines. The ✳ located on the 0.25% line indicate the mean of the positive and negative scores.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2721852&req=5

Figure 2: Analysis of known positive and negative splice sites using the Purpuratus and Vertebrate splice site models. Histograms of the scores given to known positive (solid lines) and negative (dashed lines) splice sites were generated (bin size = 2) for the Purpuratus (A) and Vertebrate (B) splice site models by analyzing the genes used to generate the models (Additional file 2, 3, and 4; [28]). For example, 22% of the known positive sites received scores between 4 and 6. The average of the means (Table 3) is shown by a vertical dotted line. The gray region corresponds to N0.95, and P0.05 (Table 3), which flank the left and right side of the gray region, respectively, and are shown as dashed/dotted lines. The ✳ located on the 0.25% line indicate the mean of the positive and negative scores.
Mentions: The performance of each model was assessed using six measures (Equations 2–7; see Methods) that required binary classification of the sites (either positive or negative). Therefore, it was necessary to define a threshold for each model that separated the positive and negative scores. Histograms of scores from known positive and negative sites were created from the sequences that were used to generate the models (Figure 2). In a perfect model, there would be no overlap between the scores assigned to the positive and negative sites, because the positive sequences would all receive scores higher than the negative sites. This was not possible, however, due to the degeneracy of the splicing machinery and because certain sequences may act as donor splice sites in some sequences but are not functional in others. These factors complicate the results such that the scores for the positive and negative sites overlap. A threshold was required to define each score as either positive or negative for the model assessments. The ideal threshold would maximize the number of positives that were classified as positive (true positives; TP) and minimize the number of negative scores incorrectly labeled as positive (false positives; FP). A number of thresholds were analyzed, including the average of all the scores, the score at which 95% of the positive scores were considered positive (P0.05), and the score at which 95% of the negative scores were considered negative (N0.95; Table 3). The best threshold was the average of the means of the positive and negative scores (Figure 2) because it provided the most favorable balance between the number of true positives and false positives, and was therefore used in the subsequent evaluations of the models.

Bottom Line: The Purpuratus model was shown to predict splice signals better than a similar model created from vertebrate sequences.The data presented herein describe the first published analyses of echinoderm splice sites and suggest that the previous methods of identifying splice signals that are based largely on vertebrate sequences may be insufficient.Furthermore, alternative or trans-gene splicing does not appear to be acting as a diversification mechanism in the 185/333 gene family.

View Article: PubMed Central - HTML - PubMed

Affiliation: The Department of Biological Sciences, Washington University, Washington, DC 20052, USA. kbuckley@gmail.com

ABSTRACT

Background: As the amount of genome sequencing data grows, so does the problem of computational gene identification, and in particular, the splicing signals that flank exon borders. Traditional methods for identifying splicing signals have been created and optimized using sequences from model organisms, mostly vertebrate and yeast species. However, as genome sequencing extends across the animal kingdom and includes various invertebrate species, the need for mechanisms to recognize splice signals in these organisms increases as well. With that aim in mind, we generated a model for identifying donor and acceptor splice sites that was optimized using sequences from the purple sea urchin, Strongylocentrotus purpuratus. This model was then used to assess the possibility of alternative or cryptic splicing within the highly variable immune response gene family known as 185/333.

Results: A donor splice site model was generated from S. purpuratus sequences that incorporates non-adjacent dependences among positions within the 9 nt splice signal and uses position weight matrices to determine the probability that the site is used for splicing. The Purpuratus model was shown to predict splice signals better than a similar model created from vertebrate sequences. Although the Purpuratus model was able to correctly predict the true splice sites within the 185/333 genes, no evidence for alternative or trans-gene splicing was observed.

Conclusion: The data presented herein describe the first published analyses of echinoderm splice sites and suggest that the previous methods of identifying splice signals that are based largely on vertebrate sequences may be insufficient. Furthermore, alternative or trans-gene splicing does not appear to be acting as a diversification mechanism in the 185/333 gene family.

Show MeSH
Related in: MedlinePlus