Limits...
Free energy estimation of short DNA duplex hybridizations.

Tulpan D, Andronescu M, Leger S - BMC Bioinformatics (2010)

Bottom Line: Sequence lengths range from 4 to 30 nucleotides and span a large GC-content percentage range.For perfect matches, we propose an extension of the Nearest-Neighbour Model that matches or exceeds the performance of the existing ones, both in terms of correlations and root mean squared errors.Based on our preliminary results, we conclude that no statistically significant differences exist among free energy approximations obtained with 4 publicly available and widely used programs, when benchmarked against a collection of 695 pairs of short oligos collected and curated by the authors of this work based on 29 publications.

View Article: PubMed Central - HTML - PubMed

Affiliation: National Research Council of Canada, Institute of Information Technology, 100 des Aboiteaux Street, Suite 1100, Moncton, NB E1A7R1, Canada. dan.tulpan@nrc-cnrc.gc.ca

ABSTRACT

Background: Estimation of DNA duplex hybridization free energy is widely used for predicting cross-hybridizations in DNA computing and microarray experiments. A number of software programs based on different methods and parametrizations are available for the theoretical estimation of duplex free energies. However, significant differences in free energy values are sometimes observed among estimations obtained with various methods, thus being difficult to decide what value is the accurate one.

Results: We present in this study a quantitative comparison of the similarities and differences among four published DNA/DNA duplex free energy calculation methods and an extended Nearest-Neighbour Model for perfect matches based on triplet interactions. The comparison was performed on a benchmark data set with 695 pairs of short oligos that we collected and manually curated from 29 publications. Sequence lengths range from 4 to 30 nucleotides and span a large GC-content percentage range. For perfect matches, we propose an extension of the Nearest-Neighbour Model that matches or exceeds the performance of the existing ones, both in terms of correlations and root mean squared errors. The proposed model was trained on experimental data with temperature, sodium and sequence concentration characteristics that span a wide range of values, thus conferring the model a higher power of generalization when used for free energy estimations of DNA duplexes under non-standard experimental conditions.

Conclusions: Based on our preliminary results, we conclude that no statistically significant differences exist among free energy approximations obtained with 4 publicly available and widely used programs, when benchmarked against a collection of 695 pairs of short oligos collected and curated by the authors of this work based on 29 publications. The extended Nearest-Neighbour Model based on triplet interactions presented in this work is capable of performing accurate estimations of free energies for perfect match duplexes under both standard and non-standard experimental conditions and may serve as a baseline for further developments in this area of research.

Show MeSH

Related in: MedlinePlus

The distribution of sequence lengths for the complete data set. The sequence length distribution of 695 DNA duplexes. The 8-mers, 9-mers followed by 17-mers have the highest frequencies, while 4-mers, 5-mers and 25-mers have the lowest frequencies.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2837027&req=5

Figure 13: The distribution of sequence lengths for the complete data set. The sequence length distribution of 695 DNA duplexes. The 8-mers, 9-mers followed by 17-mers have the highest frequencies, while 4-mers, 5-mers and 25-mers have the lowest frequencies.

Mentions: The benchmark data set used in this work consists of 695 experimental free energies and secondary structures for DNA duplexes, including 340 perfect matches and 355 imperfect matches. We collected these data from 29 publications and we present its characteristics in Table 1. We must mention that a total of 42 DNA duplexes were removed from the original data set (with 737 DNA duplexes - see Additional file 1) because the ctEnergy function from UNAFold failed to produce valid free energies, due to the lack of DNA parameters for mismatches. The removed data corresponds to 30 duplexes from [31], 4 duplexes from [32], 4 duplexes from [33], 2 duplexes from [34] and 2 duplexes from [35]. The lengths of DNA sequences in the data set range from 4 nucleotides [29] to 30 nucleotides [36], some of them (length 8 and 9) being over represented (see Figure 13).


Free energy estimation of short DNA duplex hybridizations.

Tulpan D, Andronescu M, Leger S - BMC Bioinformatics (2010)

The distribution of sequence lengths for the complete data set. The sequence length distribution of 695 DNA duplexes. The 8-mers, 9-mers followed by 17-mers have the highest frequencies, while 4-mers, 5-mers and 25-mers have the lowest frequencies.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2837027&req=5

Figure 13: The distribution of sequence lengths for the complete data set. The sequence length distribution of 695 DNA duplexes. The 8-mers, 9-mers followed by 17-mers have the highest frequencies, while 4-mers, 5-mers and 25-mers have the lowest frequencies.
Mentions: The benchmark data set used in this work consists of 695 experimental free energies and secondary structures for DNA duplexes, including 340 perfect matches and 355 imperfect matches. We collected these data from 29 publications and we present its characteristics in Table 1. We must mention that a total of 42 DNA duplexes were removed from the original data set (with 737 DNA duplexes - see Additional file 1) because the ctEnergy function from UNAFold failed to produce valid free energies, due to the lack of DNA parameters for mismatches. The removed data corresponds to 30 duplexes from [31], 4 duplexes from [32], 4 duplexes from [33], 2 duplexes from [34] and 2 duplexes from [35]. The lengths of DNA sequences in the data set range from 4 nucleotides [29] to 30 nucleotides [36], some of them (length 8 and 9) being over represented (see Figure 13).

Bottom Line: Sequence lengths range from 4 to 30 nucleotides and span a large GC-content percentage range.For perfect matches, we propose an extension of the Nearest-Neighbour Model that matches or exceeds the performance of the existing ones, both in terms of correlations and root mean squared errors.Based on our preliminary results, we conclude that no statistically significant differences exist among free energy approximations obtained with 4 publicly available and widely used programs, when benchmarked against a collection of 695 pairs of short oligos collected and curated by the authors of this work based on 29 publications.

View Article: PubMed Central - HTML - PubMed

Affiliation: National Research Council of Canada, Institute of Information Technology, 100 des Aboiteaux Street, Suite 1100, Moncton, NB E1A7R1, Canada. dan.tulpan@nrc-cnrc.gc.ca

ABSTRACT

Background: Estimation of DNA duplex hybridization free energy is widely used for predicting cross-hybridizations in DNA computing and microarray experiments. A number of software programs based on different methods and parametrizations are available for the theoretical estimation of duplex free energies. However, significant differences in free energy values are sometimes observed among estimations obtained with various methods, thus being difficult to decide what value is the accurate one.

Results: We present in this study a quantitative comparison of the similarities and differences among four published DNA/DNA duplex free energy calculation methods and an extended Nearest-Neighbour Model for perfect matches based on triplet interactions. The comparison was performed on a benchmark data set with 695 pairs of short oligos that we collected and manually curated from 29 publications. Sequence lengths range from 4 to 30 nucleotides and span a large GC-content percentage range. For perfect matches, we propose an extension of the Nearest-Neighbour Model that matches or exceeds the performance of the existing ones, both in terms of correlations and root mean squared errors. The proposed model was trained on experimental data with temperature, sodium and sequence concentration characteristics that span a wide range of values, thus conferring the model a higher power of generalization when used for free energy estimations of DNA duplexes under non-standard experimental conditions.

Conclusions: Based on our preliminary results, we conclude that no statistically significant differences exist among free energy approximations obtained with 4 publicly available and widely used programs, when benchmarked against a collection of 695 pairs of short oligos collected and curated by the authors of this work based on 29 publications. The extended Nearest-Neighbour Model based on triplet interactions presented in this work is capable of performing accurate estimations of free energies for perfect match duplexes under both standard and non-standard experimental conditions and may serve as a baseline for further developments in this area of research.

Show MeSH
Related in: MedlinePlus