Limits...
Free energy estimation of short DNA duplex hybridizations.

Tulpan D, Andronescu M, Leger S - BMC Bioinformatics (2010)

Bottom Line: Sequence lengths range from 4 to 30 nucleotides and span a large GC-content percentage range.For perfect matches, we propose an extension of the Nearest-Neighbour Model that matches or exceeds the performance of the existing ones, both in terms of correlations and root mean squared errors.Based on our preliminary results, we conclude that no statistically significant differences exist among free energy approximations obtained with 4 publicly available and widely used programs, when benchmarked against a collection of 695 pairs of short oligos collected and curated by the authors of this work based on 29 publications.

View Article: PubMed Central - HTML - PubMed

Affiliation: National Research Council of Canada, Institute of Information Technology, 100 des Aboiteaux Street, Suite 1100, Moncton, NB E1A7R1, Canada. dan.tulpan@nrc-cnrc.gc.ca

ABSTRACT

Background: Estimation of DNA duplex hybridization free energy is widely used for predicting cross-hybridizations in DNA computing and microarray experiments. A number of software programs based on different methods and parametrizations are available for the theoretical estimation of duplex free energies. However, significant differences in free energy values are sometimes observed among estimations obtained with various methods, thus being difficult to decide what value is the accurate one.

Results: We present in this study a quantitative comparison of the similarities and differences among four published DNA/DNA duplex free energy calculation methods and an extended Nearest-Neighbour Model for perfect matches based on triplet interactions. The comparison was performed on a benchmark data set with 695 pairs of short oligos that we collected and manually curated from 29 publications. Sequence lengths range from 4 to 30 nucleotides and span a large GC-content percentage range. For perfect matches, we propose an extension of the Nearest-Neighbour Model that matches or exceeds the performance of the existing ones, both in terms of correlations and root mean squared errors. The proposed model was trained on experimental data with temperature, sodium and sequence concentration characteristics that span a wide range of values, thus conferring the model a higher power of generalization when used for free energy estimations of DNA duplexes under non-standard experimental conditions.

Conclusions: Based on our preliminary results, we conclude that no statistically significant differences exist among free energy approximations obtained with 4 publicly available and widely used programs, when benchmarked against a collection of 695 pairs of short oligos collected and curated by the authors of this work based on 29 publications. The extended Nearest-Neighbour Model based on triplet interactions presented in this work is capable of performing accurate estimations of free energies for perfect match duplexes under both standard and non-standard experimental conditions and may serve as a baseline for further developments in this area of research.

Show MeSH

Related in: MedlinePlus

Distribution of triplet frequencies. The distribution of frequencies for all triplets corresponding to 340 perfect matches is presented. The triplet with the lowest frequency is CCC/GGG and the one with highest is GAC/CTG.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2837027&req=5

Figure 16: Distribution of triplet frequencies. The distribution of frequencies for all triplets corresponding to 340 perfect matches is presented. The triplet with the lowest frequency is CCC/GGG and the one with highest is GAC/CTG.

Mentions: The evaluation process of the TNN-Triplets-PM Model is summarized in Table 6. The evaluation process is repeated 10 000 times in this work. Each iteration consists of the following steps. First the data set is divided uniformly at random in a training set, TrS consisting of 67% of the data and a testing set, TeS that contains the remaining 33%. Next, the training process described in Table 5 is used to extrapolate the first set of perfect match triplet parameters. The derived parameters are used next to compute the Pearson momentum correlation coefficients and the RMSEs for each DNA perfect match duplex from TeS. Each correlation coefficient and RMSE is recorded in corresponding vectors to be analyzed later. The complete coverage of the triplet space, i.e. all possible triplets during the generation of training and testing sets using a randomized mechanism is not ensured for some of the 10 000 sets mostly due to the presence of a few under-represented (less than 20 CCC/GGG) or over-represented (more than 180 GAC/CTG) triplets that characterize the data set with perfect matches (see Figures 15 and 16). Nevertheless, we noticed that the training sets that produced the best results cover completely the triplet space. The same coverage was observed for the doublets.


Free energy estimation of short DNA duplex hybridizations.

Tulpan D, Andronescu M, Leger S - BMC Bioinformatics (2010)

Distribution of triplet frequencies. The distribution of frequencies for all triplets corresponding to 340 perfect matches is presented. The triplet with the lowest frequency is CCC/GGG and the one with highest is GAC/CTG.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2837027&req=5

Figure 16: Distribution of triplet frequencies. The distribution of frequencies for all triplets corresponding to 340 perfect matches is presented. The triplet with the lowest frequency is CCC/GGG and the one with highest is GAC/CTG.
Mentions: The evaluation process of the TNN-Triplets-PM Model is summarized in Table 6. The evaluation process is repeated 10 000 times in this work. Each iteration consists of the following steps. First the data set is divided uniformly at random in a training set, TrS consisting of 67% of the data and a testing set, TeS that contains the remaining 33%. Next, the training process described in Table 5 is used to extrapolate the first set of perfect match triplet parameters. The derived parameters are used next to compute the Pearson momentum correlation coefficients and the RMSEs for each DNA perfect match duplex from TeS. Each correlation coefficient and RMSE is recorded in corresponding vectors to be analyzed later. The complete coverage of the triplet space, i.e. all possible triplets during the generation of training and testing sets using a randomized mechanism is not ensured for some of the 10 000 sets mostly due to the presence of a few under-represented (less than 20 CCC/GGG) or over-represented (more than 180 GAC/CTG) triplets that characterize the data set with perfect matches (see Figures 15 and 16). Nevertheless, we noticed that the training sets that produced the best results cover completely the triplet space. The same coverage was observed for the doublets.

Bottom Line: Sequence lengths range from 4 to 30 nucleotides and span a large GC-content percentage range.For perfect matches, we propose an extension of the Nearest-Neighbour Model that matches or exceeds the performance of the existing ones, both in terms of correlations and root mean squared errors.Based on our preliminary results, we conclude that no statistically significant differences exist among free energy approximations obtained with 4 publicly available and widely used programs, when benchmarked against a collection of 695 pairs of short oligos collected and curated by the authors of this work based on 29 publications.

View Article: PubMed Central - HTML - PubMed

Affiliation: National Research Council of Canada, Institute of Information Technology, 100 des Aboiteaux Street, Suite 1100, Moncton, NB E1A7R1, Canada. dan.tulpan@nrc-cnrc.gc.ca

ABSTRACT

Background: Estimation of DNA duplex hybridization free energy is widely used for predicting cross-hybridizations in DNA computing and microarray experiments. A number of software programs based on different methods and parametrizations are available for the theoretical estimation of duplex free energies. However, significant differences in free energy values are sometimes observed among estimations obtained with various methods, thus being difficult to decide what value is the accurate one.

Results: We present in this study a quantitative comparison of the similarities and differences among four published DNA/DNA duplex free energy calculation methods and an extended Nearest-Neighbour Model for perfect matches based on triplet interactions. The comparison was performed on a benchmark data set with 695 pairs of short oligos that we collected and manually curated from 29 publications. Sequence lengths range from 4 to 30 nucleotides and span a large GC-content percentage range. For perfect matches, we propose an extension of the Nearest-Neighbour Model that matches or exceeds the performance of the existing ones, both in terms of correlations and root mean squared errors. The proposed model was trained on experimental data with temperature, sodium and sequence concentration characteristics that span a wide range of values, thus conferring the model a higher power of generalization when used for free energy estimations of DNA duplexes under non-standard experimental conditions.

Conclusions: Based on our preliminary results, we conclude that no statistically significant differences exist among free energy approximations obtained with 4 publicly available and widely used programs, when benchmarked against a collection of 695 pairs of short oligos collected and curated by the authors of this work based on 29 publications. The extended Nearest-Neighbour Model based on triplet interactions presented in this work is capable of performing accurate estimations of free energies for perfect match duplexes under both standard and non-standard experimental conditions and may serve as a baseline for further developments in this area of research.

Show MeSH
Related in: MedlinePlus