Limits...
Free energy estimation of short DNA duplex hybridizations.

Tulpan D, Andronescu M, Leger S - BMC Bioinformatics (2010)

Bottom Line: Sequence lengths range from 4 to 30 nucleotides and span a large GC-content percentage range.For perfect matches, we propose an extension of the Nearest-Neighbour Model that matches or exceeds the performance of the existing ones, both in terms of correlations and root mean squared errors.Based on our preliminary results, we conclude that no statistically significant differences exist among free energy approximations obtained with 4 publicly available and widely used programs, when benchmarked against a collection of 695 pairs of short oligos collected and curated by the authors of this work based on 29 publications.

View Article: PubMed Central - HTML - PubMed

Affiliation: National Research Council of Canada, Institute of Information Technology, 100 des Aboiteaux Street, Suite 1100, Moncton, NB E1A7R1, Canada. dan.tulpan@nrc-cnrc.gc.ca

ABSTRACT

Background: Estimation of DNA duplex hybridization free energy is widely used for predicting cross-hybridizations in DNA computing and microarray experiments. A number of software programs based on different methods and parametrizations are available for the theoretical estimation of duplex free energies. However, significant differences in free energy values are sometimes observed among estimations obtained with various methods, thus being difficult to decide what value is the accurate one.

Results: We present in this study a quantitative comparison of the similarities and differences among four published DNA/DNA duplex free energy calculation methods and an extended Nearest-Neighbour Model for perfect matches based on triplet interactions. The comparison was performed on a benchmark data set with 695 pairs of short oligos that we collected and manually curated from 29 publications. Sequence lengths range from 4 to 30 nucleotides and span a large GC-content percentage range. For perfect matches, we propose an extension of the Nearest-Neighbour Model that matches or exceeds the performance of the existing ones, both in terms of correlations and root mean squared errors. The proposed model was trained on experimental data with temperature, sodium and sequence concentration characteristics that span a wide range of values, thus conferring the model a higher power of generalization when used for free energy estimations of DNA duplexes under non-standard experimental conditions.

Conclusions: Based on our preliminary results, we conclude that no statistically significant differences exist among free energy approximations obtained with 4 publicly available and widely used programs, when benchmarked against a collection of 695 pairs of short oligos collected and curated by the authors of this work based on 29 publications. The extended Nearest-Neighbour Model based on triplet interactions presented in this work is capable of performing accurate estimations of free energies for perfect match duplexes under both standard and non-standard experimental conditions and may serve as a baseline for further developments in this area of research.

Show MeSH

Related in: MedlinePlus

Box plots for Pearson correlations (r) corresponding to all 340 perfect match duplexes. The figure represents box plots for Pearson correlation coefficients for all 340 perfect match duplex free energies measured at various temperatures, sequence and sodium concentrations. The doublet- and triplet-based models were executed 10 000 times on randomly selected subsets with 67% training data and 33% testing data.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2837027&req=5

Figure 4: Box plots for Pearson correlations (r) corresponding to all 340 perfect match duplexes. The figure represents box plots for Pearson correlation coefficients for all 340 perfect match duplex free energies measured at various temperatures, sequence and sodium concentrations. The doublet- and triplet-based models were executed 10 000 times on randomly selected subsets with 67% training data and 33% testing data.

Mentions: If we consider only perfect match data, the TNN-Triplets-PM Model (see Methods) is capable of estimating free energies that correlate better (r = 0.92) with experimental values (see Figure 3), than all the other methods, which show an average correlation coefficient r = 0.68. We notice also an improvement in the RMSE for the TNN-Triplets-PM Model, compared to the other programs. To ensure that this improvement is due to the triplet aspect of the model rather than other confounding factors, we created a TNN-Doublets-PM Model that has been trained and evaluated on the same perfect match data set. A detailed description of the training and evaluation procedure is provided in Tables 5 and 6. For the complete data set with perfect matches measured at various temperatures and buffer concentrations, Figures 4, 5, 6, 7, 8 and 9 show that our TNN-Triplets-PM Model consistently produces better correlations and RMSEs, when we run a random design experiment using 10 000 randomly selected subsets with 67% duplexes (228 perfect match duplexes) used for training and 33% duplexes (112 perfect match duplexes) used for testing. The same high correlations can be observed when running the TNN-Triplets-PM Model on perfect match duplex free energies measured at a temperature of 25°C and 1 M sodium concentration, while for perfect match free energies measured at 37°C and 1 M sodium concentration, the other models produce better but still comparable correlations (0.9) and RMSEs (0.7) with the TNN-Triplets-PM Model.


Free energy estimation of short DNA duplex hybridizations.

Tulpan D, Andronescu M, Leger S - BMC Bioinformatics (2010)

Box plots for Pearson correlations (r) corresponding to all 340 perfect match duplexes. The figure represents box plots for Pearson correlation coefficients for all 340 perfect match duplex free energies measured at various temperatures, sequence and sodium concentrations. The doublet- and triplet-based models were executed 10 000 times on randomly selected subsets with 67% training data and 33% testing data.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2837027&req=5

Figure 4: Box plots for Pearson correlations (r) corresponding to all 340 perfect match duplexes. The figure represents box plots for Pearson correlation coefficients for all 340 perfect match duplex free energies measured at various temperatures, sequence and sodium concentrations. The doublet- and triplet-based models were executed 10 000 times on randomly selected subsets with 67% training data and 33% testing data.
Mentions: If we consider only perfect match data, the TNN-Triplets-PM Model (see Methods) is capable of estimating free energies that correlate better (r = 0.92) with experimental values (see Figure 3), than all the other methods, which show an average correlation coefficient r = 0.68. We notice also an improvement in the RMSE for the TNN-Triplets-PM Model, compared to the other programs. To ensure that this improvement is due to the triplet aspect of the model rather than other confounding factors, we created a TNN-Doublets-PM Model that has been trained and evaluated on the same perfect match data set. A detailed description of the training and evaluation procedure is provided in Tables 5 and 6. For the complete data set with perfect matches measured at various temperatures and buffer concentrations, Figures 4, 5, 6, 7, 8 and 9 show that our TNN-Triplets-PM Model consistently produces better correlations and RMSEs, when we run a random design experiment using 10 000 randomly selected subsets with 67% duplexes (228 perfect match duplexes) used for training and 33% duplexes (112 perfect match duplexes) used for testing. The same high correlations can be observed when running the TNN-Triplets-PM Model on perfect match duplex free energies measured at a temperature of 25°C and 1 M sodium concentration, while for perfect match free energies measured at 37°C and 1 M sodium concentration, the other models produce better but still comparable correlations (0.9) and RMSEs (0.7) with the TNN-Triplets-PM Model.

Bottom Line: Sequence lengths range from 4 to 30 nucleotides and span a large GC-content percentage range.For perfect matches, we propose an extension of the Nearest-Neighbour Model that matches or exceeds the performance of the existing ones, both in terms of correlations and root mean squared errors.Based on our preliminary results, we conclude that no statistically significant differences exist among free energy approximations obtained with 4 publicly available and widely used programs, when benchmarked against a collection of 695 pairs of short oligos collected and curated by the authors of this work based on 29 publications.

View Article: PubMed Central - HTML - PubMed

Affiliation: National Research Council of Canada, Institute of Information Technology, 100 des Aboiteaux Street, Suite 1100, Moncton, NB E1A7R1, Canada. dan.tulpan@nrc-cnrc.gc.ca

ABSTRACT

Background: Estimation of DNA duplex hybridization free energy is widely used for predicting cross-hybridizations in DNA computing and microarray experiments. A number of software programs based on different methods and parametrizations are available for the theoretical estimation of duplex free energies. However, significant differences in free energy values are sometimes observed among estimations obtained with various methods, thus being difficult to decide what value is the accurate one.

Results: We present in this study a quantitative comparison of the similarities and differences among four published DNA/DNA duplex free energy calculation methods and an extended Nearest-Neighbour Model for perfect matches based on triplet interactions. The comparison was performed on a benchmark data set with 695 pairs of short oligos that we collected and manually curated from 29 publications. Sequence lengths range from 4 to 30 nucleotides and span a large GC-content percentage range. For perfect matches, we propose an extension of the Nearest-Neighbour Model that matches or exceeds the performance of the existing ones, both in terms of correlations and root mean squared errors. The proposed model was trained on experimental data with temperature, sodium and sequence concentration characteristics that span a wide range of values, thus conferring the model a higher power of generalization when used for free energy estimations of DNA duplexes under non-standard experimental conditions.

Conclusions: Based on our preliminary results, we conclude that no statistically significant differences exist among free energy approximations obtained with 4 publicly available and widely used programs, when benchmarked against a collection of 695 pairs of short oligos collected and curated by the authors of this work based on 29 publications. The extended Nearest-Neighbour Model based on triplet interactions presented in this work is capable of performing accurate estimations of free energies for perfect match duplexes under both standard and non-standard experimental conditions and may serve as a baseline for further developments in this area of research.

Show MeSH
Related in: MedlinePlus