Limits...
Next generation sequencing reads comparison with an alignment-free distance.

Weitschek E, Santoni D, Fiscon G, De Cola MC, Bertolazzi P, Felici G - BMC Res Notes (2014)

Bottom Line: This method does not rely on the alignment of the reads and it is based on the distance between the frequencies of their substrings of fixed dimensions (k-mers).We exhibit experimental evidence that the proposed alignment-free distance is a potentially useful read-to-read distance measure and performs better than the more time consuming distances based on alignment.Alignment-free distances may be used effectively for reads comparison, and may provide a significant speed-up in several processes based on NGS sequencing (e.g., DNA assembly, reads classification).

View Article: PubMed Central - PubMed

Affiliation: Department of Engineering, Roma Tre University, Via della Vasca Navale 79, 00146 Rome, Italy. emanuel@dia.uniroma3.it.

ABSTRACT

Background: Next Generation Sequencing (NGS) machines extract from a biological sample a large number of short DNA fragments (reads). These reads are then used for several applications, e.g., sequence reconstruction, DNA assembly, gene expression profiling, mutation analysis.

Methods: We propose a method to evaluate the similarity between reads. This method does not rely on the alignment of the reads and it is based on the distance between the frequencies of their substrings of fixed dimensions (k-mers). We compare this alignment-free distance with the similarity measures derived from two alignment methods: Needleman-Wunsch and Blast. The comparison is based on a simple assumption: the most correct distance is obtained by knowing in advance the reference sequence. Therefore, we first align the reads on the original DNA sequence, compute the overlap between the aligned reads, and use this overlap as an ideal distance. We then verify how the alignment-free and the alignment-based distances reproduce this ideal distance. The ability of correctly reproducing the ideal distance is evaluated over samples of read pairs from Saccharomyces cerevisiae, Escherichia coli, and Homo sapiens. The comparison is based on the correctness of threshold predictors cross-validated over different samples.

Results: We exhibit experimental evidence that the proposed alignment-free distance is a potentially useful read-to-read distance measure and performs better than the more time consuming distances based on alignment.

Conclusions: Alignment-free distances may be used effectively for reads comparison, and may provide a significant speed-up in several processes based on NGS sequencing (e.g., DNA assembly, reads classification).

No MeSH data available.


Related in: MedlinePlus

AUC values for target percentile values. AUC values for each percentile value of the target BT distance. The three panels report AUC values for the three predictor distances (NW, BL, AF). Results are provided on samples for yeast (panel A, YA), E. coli (panel B, EA), and human (panel C, HA).
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC4265526&req=5

Fig4: AUC values for target percentile values. AUC values for each percentile value of the target BT distance. The three panels report AUC values for the three predictor distances (NW, BL, AF). Results are provided on samples for yeast (panel A, YA), E. coli (panel B, EA), and human (panel C, HA).

Mentions: We start presenting the ROC curves for the four values 0.10, 0.15, 0.20, and 0.25 of the target distance percentiles that correspond to determined values of BT. In Figures1,2 and3 the ROC curves for predictors NW, BL, and AF are reported for the four reference values and for the three organisms. As one can observe, both AF (green, solid) and BL (blue, dotted) curves perform much better than NW (red, dashed). Figure1 depicts the ROC curves related to yeast and shows how both AF and BL perform much better than NW with AUC values higher than 0.9, while NW curves have AUC much smaller values (close to 0.7). Figure2 is related to E. coli and shows a very stable scenario: all the three measures are able to precisely predict BT for all the thresholds reaching values of AUC close to 1. Figure3 is related to human and shows that AF performs slightly better than NW that in turn performs slightly better than BL. AUC values of AF are close to 0.95, while those of NW are around 0.91, and those of BL range from 0.9 to 0.88. A more comprehensive outlook of the performances of the three predictors can be glanced from the three panels in Figure4. Here we report the AUC values for all the 100 percentiles of the target distance, for three samples coming from yeast, E. coli, and human. Similar results are obtained when the other five samples from each organism are used (see Additional file1). The charts clearly show that for all three predictors the precision decreases for higher percentiles (i.e., larger values of the target distance). Lower percentiles (i.e., lower BT) correspond to higher level of overlapping, and we conclude that for these percentiles BT is easily predicted by the three measures.Figure 1


Next generation sequencing reads comparison with an alignment-free distance.

Weitschek E, Santoni D, Fiscon G, De Cola MC, Bertolazzi P, Felici G - BMC Res Notes (2014)

AUC values for target percentile values. AUC values for each percentile value of the target BT distance. The three panels report AUC values for the three predictor distances (NW, BL, AF). Results are provided on samples for yeast (panel A, YA), E. coli (panel B, EA), and human (panel C, HA).
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC4265526&req=5

Fig4: AUC values for target percentile values. AUC values for each percentile value of the target BT distance. The three panels report AUC values for the three predictor distances (NW, BL, AF). Results are provided on samples for yeast (panel A, YA), E. coli (panel B, EA), and human (panel C, HA).
Mentions: We start presenting the ROC curves for the four values 0.10, 0.15, 0.20, and 0.25 of the target distance percentiles that correspond to determined values of BT. In Figures1,2 and3 the ROC curves for predictors NW, BL, and AF are reported for the four reference values and for the three organisms. As one can observe, both AF (green, solid) and BL (blue, dotted) curves perform much better than NW (red, dashed). Figure1 depicts the ROC curves related to yeast and shows how both AF and BL perform much better than NW with AUC values higher than 0.9, while NW curves have AUC much smaller values (close to 0.7). Figure2 is related to E. coli and shows a very stable scenario: all the three measures are able to precisely predict BT for all the thresholds reaching values of AUC close to 1. Figure3 is related to human and shows that AF performs slightly better than NW that in turn performs slightly better than BL. AUC values of AF are close to 0.95, while those of NW are around 0.91, and those of BL range from 0.9 to 0.88. A more comprehensive outlook of the performances of the three predictors can be glanced from the three panels in Figure4. Here we report the AUC values for all the 100 percentiles of the target distance, for three samples coming from yeast, E. coli, and human. Similar results are obtained when the other five samples from each organism are used (see Additional file1). The charts clearly show that for all three predictors the precision decreases for higher percentiles (i.e., larger values of the target distance). Lower percentiles (i.e., lower BT) correspond to higher level of overlapping, and we conclude that for these percentiles BT is easily predicted by the three measures.Figure 1

Bottom Line: This method does not rely on the alignment of the reads and it is based on the distance between the frequencies of their substrings of fixed dimensions (k-mers).We exhibit experimental evidence that the proposed alignment-free distance is a potentially useful read-to-read distance measure and performs better than the more time consuming distances based on alignment.Alignment-free distances may be used effectively for reads comparison, and may provide a significant speed-up in several processes based on NGS sequencing (e.g., DNA assembly, reads classification).

View Article: PubMed Central - PubMed

Affiliation: Department of Engineering, Roma Tre University, Via della Vasca Navale 79, 00146 Rome, Italy. emanuel@dia.uniroma3.it.

ABSTRACT

Background: Next Generation Sequencing (NGS) machines extract from a biological sample a large number of short DNA fragments (reads). These reads are then used for several applications, e.g., sequence reconstruction, DNA assembly, gene expression profiling, mutation analysis.

Methods: We propose a method to evaluate the similarity between reads. This method does not rely on the alignment of the reads and it is based on the distance between the frequencies of their substrings of fixed dimensions (k-mers). We compare this alignment-free distance with the similarity measures derived from two alignment methods: Needleman-Wunsch and Blast. The comparison is based on a simple assumption: the most correct distance is obtained by knowing in advance the reference sequence. Therefore, we first align the reads on the original DNA sequence, compute the overlap between the aligned reads, and use this overlap as an ideal distance. We then verify how the alignment-free and the alignment-based distances reproduce this ideal distance. The ability of correctly reproducing the ideal distance is evaluated over samples of read pairs from Saccharomyces cerevisiae, Escherichia coli, and Homo sapiens. The comparison is based on the correctness of threshold predictors cross-validated over different samples.

Results: We exhibit experimental evidence that the proposed alignment-free distance is a potentially useful read-to-read distance measure and performs better than the more time consuming distances based on alignment.

Conclusions: Alignment-free distances may be used effectively for reads comparison, and may provide a significant speed-up in several processes based on NGS sequencing (e.g., DNA assembly, reads classification).

No MeSH data available.


Related in: MedlinePlus