Limits...
Genomic signal processing methods for computation of alignment-free distances from DNA sequences.

Borrayo E, Mendizabal-Ruiz EG, Vélez-Pérez H, Romo-Vázquez R, Mendizabal AP, Morales JA - PLoS ONE (2014)

Bottom Line: We introduce a DNA sequence-to-signal mapping function based on the employment of doublet values, which increases the number of possible amplitude values for the generated signal.Additionally, we explore the use of three DSP distance metrics as descriptors for categorizing DNA signal fragments.Our results indicate the feasibility of employing GAFD for computing sequence distances and the use of descriptors for characterizing DNA fragments.

View Article: PubMed Central - PubMed

Affiliation: Computer Sciences Department, CUCEI - Universidad de Guadalajara, Guadalajara, México.

ABSTRACT
Genomic signal processing (GSP) refers to the use of digital signal processing (DSP) tools for analyzing genomic data such as DNA sequences. A possible application of GSP that has not been fully explored is the computation of the distance between a pair of sequences. In this work we present GAFD, a novel GSP alignment-free distance computation method. We introduce a DNA sequence-to-signal mapping function based on the employment of doublet values, which increases the number of possible amplitude values for the generated signal. Additionally, we explore the use of three DSP distance metrics as descriptors for categorizing DNA signal fragments. Our results indicate the feasibility of employing GAFD for computing sequence distances and the use of descriptors for characterizing DNA fragments.

No MeSH data available.


Depiction of the similarity space created with the three GSP distance descriptors.A random 20 nt sequence was created. Using this sequence as a template, all possible combinations for up to three substitutions were created and measured against the template using the three distance descriptors. The dots in A, B, and C correspond to the distances for one (red), two (blue), and three (green) substitutions, respectively. As expected, the more substitutions present, the farther they scattered along the frequency peak. Subsequently, starting with the same template, all possible combinations of insertions, deletions, and substitutions were created and measured similarly as aforementioned. The dots in D, E, and F correspond to the distances for insertions (yellow), deletions (brown), and substitutions (green). The distance scatters shift between substitutions and indels, which is especially evident in the Correlation and Derivative descriptors. The blue scatter on A through C is equal to the green scatter on D through F.
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4230918&req=5

pone-0110954-g006: Depiction of the similarity space created with the three GSP distance descriptors.A random 20 nt sequence was created. Using this sequence as a template, all possible combinations for up to three substitutions were created and measured against the template using the three distance descriptors. The dots in A, B, and C correspond to the distances for one (red), two (blue), and three (green) substitutions, respectively. As expected, the more substitutions present, the farther they scattered along the frequency peak. Subsequently, starting with the same template, all possible combinations of insertions, deletions, and substitutions were created and measured similarly as aforementioned. The dots in D, E, and F correspond to the distances for insertions (yellow), deletions (brown), and substitutions (green). The distance scatters shift between substitutions and indels, which is especially evident in the Correlation and Derivative descriptors. The blue scatter on A through C is equal to the green scatter on D through F.

Mentions: To explore the three-dimensional space generated by the proposed descriptors , we performed an experiment in which we perturbed a randomly generated DNA sequence that generates a DNA signal of length . Using as the “mother sequence”, we generated all the DNA sequences and signals corresponding to all possible combinations of one, two, and three changes, considering all possible types of changes (i.e., substitutions, deletions, and insertions). Every pair of signals generated a point in the this space (Figure 6). Our results from the comparisons corresponding to one change were located near the origin, while those corresponding to two or three changes were positioned at increasing distance from the origin according to the number of changes. Additionally, the points corresponding to substitutions were well-separated from those corresponding to insertions and deletions (Figures 6 D and 6 E). These results demonstrate that GAFD can characterize the type of change present using a classification technique that combines several descriptors. However, coherence exhibited the poorest results since a lack of specificity for detecting insertions and substitutions was observed. This result is supported by Sims, et al. [53], where it was reported that optimal resolutions (length of ) proved critical for genomic comparisons. Moreover, studies have shown that coherence AR models depend highly on the parameters employed [54].


Genomic signal processing methods for computation of alignment-free distances from DNA sequences.

Borrayo E, Mendizabal-Ruiz EG, Vélez-Pérez H, Romo-Vázquez R, Mendizabal AP, Morales JA - PLoS ONE (2014)

Depiction of the similarity space created with the three GSP distance descriptors.A random 20 nt sequence was created. Using this sequence as a template, all possible combinations for up to three substitutions were created and measured against the template using the three distance descriptors. The dots in A, B, and C correspond to the distances for one (red), two (blue), and three (green) substitutions, respectively. As expected, the more substitutions present, the farther they scattered along the frequency peak. Subsequently, starting with the same template, all possible combinations of insertions, deletions, and substitutions were created and measured similarly as aforementioned. The dots in D, E, and F correspond to the distances for insertions (yellow), deletions (brown), and substitutions (green). The distance scatters shift between substitutions and indels, which is especially evident in the Correlation and Derivative descriptors. The blue scatter on A through C is equal to the green scatter on D through F.
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4230918&req=5

pone-0110954-g006: Depiction of the similarity space created with the three GSP distance descriptors.A random 20 nt sequence was created. Using this sequence as a template, all possible combinations for up to three substitutions were created and measured against the template using the three distance descriptors. The dots in A, B, and C correspond to the distances for one (red), two (blue), and three (green) substitutions, respectively. As expected, the more substitutions present, the farther they scattered along the frequency peak. Subsequently, starting with the same template, all possible combinations of insertions, deletions, and substitutions were created and measured similarly as aforementioned. The dots in D, E, and F correspond to the distances for insertions (yellow), deletions (brown), and substitutions (green). The distance scatters shift between substitutions and indels, which is especially evident in the Correlation and Derivative descriptors. The blue scatter on A through C is equal to the green scatter on D through F.
Mentions: To explore the three-dimensional space generated by the proposed descriptors , we performed an experiment in which we perturbed a randomly generated DNA sequence that generates a DNA signal of length . Using as the “mother sequence”, we generated all the DNA sequences and signals corresponding to all possible combinations of one, two, and three changes, considering all possible types of changes (i.e., substitutions, deletions, and insertions). Every pair of signals generated a point in the this space (Figure 6). Our results from the comparisons corresponding to one change were located near the origin, while those corresponding to two or three changes were positioned at increasing distance from the origin according to the number of changes. Additionally, the points corresponding to substitutions were well-separated from those corresponding to insertions and deletions (Figures 6 D and 6 E). These results demonstrate that GAFD can characterize the type of change present using a classification technique that combines several descriptors. However, coherence exhibited the poorest results since a lack of specificity for detecting insertions and substitutions was observed. This result is supported by Sims, et al. [53], where it was reported that optimal resolutions (length of ) proved critical for genomic comparisons. Moreover, studies have shown that coherence AR models depend highly on the parameters employed [54].

Bottom Line: We introduce a DNA sequence-to-signal mapping function based on the employment of doublet values, which increases the number of possible amplitude values for the generated signal.Additionally, we explore the use of three DSP distance metrics as descriptors for categorizing DNA signal fragments.Our results indicate the feasibility of employing GAFD for computing sequence distances and the use of descriptors for characterizing DNA fragments.

View Article: PubMed Central - PubMed

Affiliation: Computer Sciences Department, CUCEI - Universidad de Guadalajara, Guadalajara, México.

ABSTRACT
Genomic signal processing (GSP) refers to the use of digital signal processing (DSP) tools for analyzing genomic data such as DNA sequences. A possible application of GSP that has not been fully explored is the computation of the distance between a pair of sequences. In this work we present GAFD, a novel GSP alignment-free distance computation method. We introduce a DNA sequence-to-signal mapping function based on the employment of doublet values, which increases the number of possible amplitude values for the generated signal. Additionally, we explore the use of three DSP distance metrics as descriptors for categorizing DNA signal fragments. Our results indicate the feasibility of employing GAFD for computing sequence distances and the use of descriptors for characterizing DNA fragments.

No MeSH data available.