Limits...
Towards realistic benchmarks for multiple alignments of non-coding sequences.

Kim J, Sinha S - BMC Bioinformatics (2010)

Bottom Line: With the continued development of new computational tools for multiple sequence alignment, it is necessary today to develop benchmarks that aid the selection of the most effective tools.We evaluate six widely used multiple alignment tools in the context of Drosophila non-coding sequences, and find the accuracy to be significantly different from previously reported values.Apart from helping to select the most effective tools, these benchmarks will help practitioners of comparative genomics deal with the effects of alignment errors, by providing accurate estimates of the extent of these errors.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA.

ABSTRACT

Background: With the continued development of new computational tools for multiple sequence alignment, it is necessary today to develop benchmarks that aid the selection of the most effective tools. Simulation-based benchmarks have been proposed to meet this necessity, especially for non-coding sequences. However, it is not clear if such benchmarks truly represent real sequence data from any given group of species, in terms of the difficulty of alignment tasks.

Results: We find that the conventional simulation approach, which relies on empirically estimated values for various parameters such as substitution rate or insertion/deletion rates, is unable to generate synthetic sequences reflecting the broad genomic variation in conservation levels. We tackle this problem with a new method for simulating non-coding sequence evolution, by relying on genome-wide distributions of evolutionary parameters rather than their averages. We then generate synthetic data sets to mimic orthologous sequences from the Drosophila group of species, and show that these data sets truly represent the variability observed in genomic data in terms of the difficulty of the alignment task. This allows us to make significant progress towards estimating the alignment accuracy of current tools in an absolute sense, going beyond only a relative assessment of different tools. We evaluate six widely used multiple alignment tools in the context of Drosophila non-coding sequences, and find the accuracy to be significantly different from previously reported values. Interestingly, the performance of most tools degrades more rapidly when there are more insertions than deletions in the data set, suggesting an asymmetric handling of insertions and deletions, even though none of the evaluated tools explicitly distinguishes these two types of events. We also examine the accuracy of two existing tools for annotating insertions versus deletions, and find their performance to be close to optimal in Drosophila non-coding sequences if provided with the true alignments.

Conclusion: We have developed a method to generate benchmarks for multiple alignments of Drosophila non-coding sequences, and shown it to be more realistic than traditional benchmarks. Apart from helping to select the most effective tools, these benchmarks will help practitioners of comparative genomics deal with the effects of alignment errors, by providing accurate estimates of the extent of these errors.

Show MeSH
Distributions of alignment quality scores of data sets representing D. melanogaster - D. pseudoobscura pair from real genomes, Pollard et al. [21], and our benchmark. The collected data sets from each of the three sources were aligned by Pecan [33] and then their alignment quality scores were calculated by HoT SPS [27] method.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2823711&req=5

Figure 5: Distributions of alignment quality scores of data sets representing D. melanogaster - D. pseudoobscura pair from real genomes, Pollard et al. [21], and our benchmark. The collected data sets from each of the three sources were aligned by Pecan [33] and then their alignment quality scores were calculated by HoT SPS [27] method.

Mentions: We found a substantial disagreement between our performance estimates and those previously reported by Pollard et al. [21] using their own benchmark. For instance, the alignment sensitivity for the D. melanogaster - D. pseudoobscura pair comes out to be ~70% in our assessment and ~40% by their estimates, using the Mlagan alignment tool. We observe such gaps (with higher numbers in our benchmark) also for alignment specificity, and for other species pairs and alignment programs as well (Additional files 7 and 8). (We confirmed this by evaluating the alignment programs ourselves on the Pollard et al. [21] benchmark, see Methods.) While this discordance could be in part due to the fact that our benchmark employs a spectrum of parameter values to achieve greater and more realistic variability, we believe the major difference here is that even the average substitution rate, a key parameter in both simulation programs, is widely different between their study and ours. The estimate used by Pollard et al. [21] (~2.4 substitutions per site) is based on silent positions in codons, while our estimate (~0.38 substitutions per site) reflects the average subsitution frequency (between these species) seen in non-coding sequences. In light of the results of Figure 2D, where we show that our benchmark accurately mirrors the range of alignment difficulty in real data, the use of non-coding sequences in estimating this key parameter seems better justified. We investigated this issue with additional tests. We collected data sets representing the D. melanogaster - D. pseudoobscura pair from Pollard et al. [21], as well as from our benchmark and the real genomes. The alignment quality score (HoT SPS) distributions were computed for each type of benchmark, and are shown in Figure 5. We observed a close agreement between our data sets and the real orthologous sequences, while the Pollard et al. [21] data sets were harder to align on average, consistent with the greater substitution rate used there. As noted in Methods, the overall substitution frequency observed in non-coding sequences may be viewed as an average of the corresponding frequency in conserved blocks and the much higher frequency outside conserved blocks. This average is determined by two key parameters α, the fraction of sequence length that falls into conserved blocks, and β, the ratio of the evolutionary rate of conserved blocks to that outside blocks. Given that the divergence estimate used by Pollard et al. [21] for these two species is ~2.24 (median) substitutions per site, if we are to treat this value as the neutral rate (i.e., rate outside conserved blocks) in non-coding sequences, what values of α and β would lead to the observed overall substitution frequency of 0.38? We determined that if β = 0.1, as was used by Pollard et al. [21] (and also by us), α has to be ~0.92, i.e., about 92% of non-coding sequences have to be conserved blocks, which is far higher than most current estimates of this parameter [37,42]. Similarly, if we are to trust the values of α = 0.2 and β = 0.1, as was used by Pollard et al. [21] (and also by us, based on estimates from real data), then the overall divergence, after averaging between conserved blocks and non-blocks, would be ~1.84 substitutions per site, far greater than what is observed (0.38). We therefore concluded that the use of synonymous substitution rates as the neutral rate for non-coding sequence is likely to lead to benchmarks with overly "diverged" sequences that are more difficult to align than real sequences from those species.


Towards realistic benchmarks for multiple alignments of non-coding sequences.

Kim J, Sinha S - BMC Bioinformatics (2010)

Distributions of alignment quality scores of data sets representing D. melanogaster - D. pseudoobscura pair from real genomes, Pollard et al. [21], and our benchmark. The collected data sets from each of the three sources were aligned by Pecan [33] and then their alignment quality scores were calculated by HoT SPS [27] method.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2823711&req=5

Figure 5: Distributions of alignment quality scores of data sets representing D. melanogaster - D. pseudoobscura pair from real genomes, Pollard et al. [21], and our benchmark. The collected data sets from each of the three sources were aligned by Pecan [33] and then their alignment quality scores were calculated by HoT SPS [27] method.
Mentions: We found a substantial disagreement between our performance estimates and those previously reported by Pollard et al. [21] using their own benchmark. For instance, the alignment sensitivity for the D. melanogaster - D. pseudoobscura pair comes out to be ~70% in our assessment and ~40% by their estimates, using the Mlagan alignment tool. We observe such gaps (with higher numbers in our benchmark) also for alignment specificity, and for other species pairs and alignment programs as well (Additional files 7 and 8). (We confirmed this by evaluating the alignment programs ourselves on the Pollard et al. [21] benchmark, see Methods.) While this discordance could be in part due to the fact that our benchmark employs a spectrum of parameter values to achieve greater and more realistic variability, we believe the major difference here is that even the average substitution rate, a key parameter in both simulation programs, is widely different between their study and ours. The estimate used by Pollard et al. [21] (~2.4 substitutions per site) is based on silent positions in codons, while our estimate (~0.38 substitutions per site) reflects the average subsitution frequency (between these species) seen in non-coding sequences. In light of the results of Figure 2D, where we show that our benchmark accurately mirrors the range of alignment difficulty in real data, the use of non-coding sequences in estimating this key parameter seems better justified. We investigated this issue with additional tests. We collected data sets representing the D. melanogaster - D. pseudoobscura pair from Pollard et al. [21], as well as from our benchmark and the real genomes. The alignment quality score (HoT SPS) distributions were computed for each type of benchmark, and are shown in Figure 5. We observed a close agreement between our data sets and the real orthologous sequences, while the Pollard et al. [21] data sets were harder to align on average, consistent with the greater substitution rate used there. As noted in Methods, the overall substitution frequency observed in non-coding sequences may be viewed as an average of the corresponding frequency in conserved blocks and the much higher frequency outside conserved blocks. This average is determined by two key parameters α, the fraction of sequence length that falls into conserved blocks, and β, the ratio of the evolutionary rate of conserved blocks to that outside blocks. Given that the divergence estimate used by Pollard et al. [21] for these two species is ~2.24 (median) substitutions per site, if we are to treat this value as the neutral rate (i.e., rate outside conserved blocks) in non-coding sequences, what values of α and β would lead to the observed overall substitution frequency of 0.38? We determined that if β = 0.1, as was used by Pollard et al. [21] (and also by us), α has to be ~0.92, i.e., about 92% of non-coding sequences have to be conserved blocks, which is far higher than most current estimates of this parameter [37,42]. Similarly, if we are to trust the values of α = 0.2 and β = 0.1, as was used by Pollard et al. [21] (and also by us, based on estimates from real data), then the overall divergence, after averaging between conserved blocks and non-blocks, would be ~1.84 substitutions per site, far greater than what is observed (0.38). We therefore concluded that the use of synonymous substitution rates as the neutral rate for non-coding sequence is likely to lead to benchmarks with overly "diverged" sequences that are more difficult to align than real sequences from those species.

Bottom Line: With the continued development of new computational tools for multiple sequence alignment, it is necessary today to develop benchmarks that aid the selection of the most effective tools.We evaluate six widely used multiple alignment tools in the context of Drosophila non-coding sequences, and find the accuracy to be significantly different from previously reported values.Apart from helping to select the most effective tools, these benchmarks will help practitioners of comparative genomics deal with the effects of alignment errors, by providing accurate estimates of the extent of these errors.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA.

ABSTRACT

Background: With the continued development of new computational tools for multiple sequence alignment, it is necessary today to develop benchmarks that aid the selection of the most effective tools. Simulation-based benchmarks have been proposed to meet this necessity, especially for non-coding sequences. However, it is not clear if such benchmarks truly represent real sequence data from any given group of species, in terms of the difficulty of alignment tasks.

Results: We find that the conventional simulation approach, which relies on empirically estimated values for various parameters such as substitution rate or insertion/deletion rates, is unable to generate synthetic sequences reflecting the broad genomic variation in conservation levels. We tackle this problem with a new method for simulating non-coding sequence evolution, by relying on genome-wide distributions of evolutionary parameters rather than their averages. We then generate synthetic data sets to mimic orthologous sequences from the Drosophila group of species, and show that these data sets truly represent the variability observed in genomic data in terms of the difficulty of the alignment task. This allows us to make significant progress towards estimating the alignment accuracy of current tools in an absolute sense, going beyond only a relative assessment of different tools. We evaluate six widely used multiple alignment tools in the context of Drosophila non-coding sequences, and find the accuracy to be significantly different from previously reported values. Interestingly, the performance of most tools degrades more rapidly when there are more insertions than deletions in the data set, suggesting an asymmetric handling of insertions and deletions, even though none of the evaluated tools explicitly distinguishes these two types of events. We also examine the accuracy of two existing tools for annotating insertions versus deletions, and find their performance to be close to optimal in Drosophila non-coding sequences if provided with the true alignments.

Conclusion: We have developed a method to generate benchmarks for multiple alignments of Drosophila non-coding sequences, and shown it to be more realistic than traditional benchmarks. Apart from helping to select the most effective tools, these benchmarks will help practitioners of comparative genomics deal with the effects of alignment errors, by providing accurate estimates of the extent of these errors.

Show MeSH