Limits...
Detection theory in identification of RNA-DNA sequence differences using RNA-sequencing.

Toung JM, Lahens N, Hogenesch JB, Grant G - PLoS ONE (2014)

Bottom Line: We found approximately 6,000 RDDs, the majority of which are A-to-G edits and likely to be mediated by ADAR.Moreover, we found the majority of non A-to-G RDDs to be associated with poorer alignments and conclude from these results that the evidence for widespread non-canonical RDDs in humans is weak.Overall, we found RNA-Seq to be a powerful technique for surveying RDDs genome-wide when coupled with the appropriate thresholds and filters.

View Article: PubMed Central - PubMed

Affiliation: Genomics and Computational Biology Graduate Program, University of Pennsylvania School of Medicine, Philadelphia, PA, United States of America.

ABSTRACT
Advances in sequencing technology have allowed for detailed analyses of the transcriptome at single-nucleotide resolution, facilitating the study of RNA editing or sequence differences between RNA and DNA genome-wide. In humans, two types of post-transcriptional RNA editing processes are known to occur: A-to-I deamination by ADAR and C-to-U deamination by APOBEC1. In addition to these sequence differences, researchers have reported the existence of all 12 types of RNA-DNA sequence differences (RDDs); however, the validity of these claims is debated, as many studies claim that technical artifacts account for the majority of these non-canonical sequence differences. In this study, we used a detection theory approach to evaluate the performance of RNA-Sequencing (RNA-Seq) and associated aligners in accurately identifying RNA-DNA sequence differences. By generating simulated RNA-Seq datasets containing RDDs, we assessed the effect of alignment artifacts and sequencing error on the sensitivity and false discovery rate of RDD detection. Overall, we found that even in the presence of sequencing errors, false negative and false discovery rates of RDD detection can be contained below 10% with relatively lenient thresholds. We also assessed the ability of various filters to target false positive RDDs and found them to be effective in discriminating between true and false positives. Lastly, we used the optimal thresholds we identified from our simulated analyses to identify RDDs in a human lymphoblastoid cell line. We found approximately 6,000 RDDs, the majority of which are A-to-G edits and likely to be mediated by ADAR. Moreover, we found the majority of non A-to-G RDDs to be associated with poorer alignments and conclude from these results that the evidence for widespread non-canonical RDDs in humans is weak. Overall, we found RNA-Seq to be a powerful technique for surveying RDDs genome-wide when coupled with the appropriate thresholds and filters.

Show MeSH

Related in: MedlinePlus

Simulated versus observed levels of RNA-DNA sequence differences.Here we plot the simulated RDD level versus the observed level as determined by GSNAP, MapSplice, RUM, or Tophat for replicate 1. Sites with coverage less than 10x or a RDD level less than 10% per the simulated dataset are removed from consideration. Overall, we observed the correlation between simulated and observed levels to be approximately 98% in both datasets and across the various aligners and replicates.
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4232354&req=5

pone-0112040-g003: Simulated versus observed levels of RNA-DNA sequence differences.Here we plot the simulated RDD level versus the observed level as determined by GSNAP, MapSplice, RUM, or Tophat for replicate 1. Sites with coverage less than 10x or a RDD level less than 10% per the simulated dataset are removed from consideration. Overall, we observed the correlation between simulated and observed levels to be approximately 98% in both datasets and across the various aligners and replicates.

Mentions: In many studies, the mere detection of RDDs is not sufficient. For example, in studies on RNA editing or differential allelic expression, information about the degree or level of difference is important. Here we analyzed the correlation between simulated and observed RDD levels. Based on our previous analyses, we restricted our study to sites with a minimum coverage of 10x, minimum level of 10%, and minimum of 1 read bearing the sequence difference base. Using this threshold, we calculated the correlation between observed and simulated RDD levels to be relatively high, at approximately 98 on average across all three replicates for all four aligners and both datasets (Figure 3; Table S8 in File S1). Although the simulated and observed levels correspond well, we found that roughly 20 to 40% of sites in each dataset for any aligner have observed levels that deviate from the simulated values by more than 5% (Figure S5). In particular, we found that in the majority (75 to 90%) of cases in which the observed and simulated levels deviate by at least 10%, the observed level underestimates the simulated level.


Detection theory in identification of RNA-DNA sequence differences using RNA-sequencing.

Toung JM, Lahens N, Hogenesch JB, Grant G - PLoS ONE (2014)

Simulated versus observed levels of RNA-DNA sequence differences.Here we plot the simulated RDD level versus the observed level as determined by GSNAP, MapSplice, RUM, or Tophat for replicate 1. Sites with coverage less than 10x or a RDD level less than 10% per the simulated dataset are removed from consideration. Overall, we observed the correlation between simulated and observed levels to be approximately 98% in both datasets and across the various aligners and replicates.
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4232354&req=5

pone-0112040-g003: Simulated versus observed levels of RNA-DNA sequence differences.Here we plot the simulated RDD level versus the observed level as determined by GSNAP, MapSplice, RUM, or Tophat for replicate 1. Sites with coverage less than 10x or a RDD level less than 10% per the simulated dataset are removed from consideration. Overall, we observed the correlation between simulated and observed levels to be approximately 98% in both datasets and across the various aligners and replicates.
Mentions: In many studies, the mere detection of RDDs is not sufficient. For example, in studies on RNA editing or differential allelic expression, information about the degree or level of difference is important. Here we analyzed the correlation between simulated and observed RDD levels. Based on our previous analyses, we restricted our study to sites with a minimum coverage of 10x, minimum level of 10%, and minimum of 1 read bearing the sequence difference base. Using this threshold, we calculated the correlation between observed and simulated RDD levels to be relatively high, at approximately 98 on average across all three replicates for all four aligners and both datasets (Figure 3; Table S8 in File S1). Although the simulated and observed levels correspond well, we found that roughly 20 to 40% of sites in each dataset for any aligner have observed levels that deviate from the simulated values by more than 5% (Figure S5). In particular, we found that in the majority (75 to 90%) of cases in which the observed and simulated levels deviate by at least 10%, the observed level underestimates the simulated level.

Bottom Line: We found approximately 6,000 RDDs, the majority of which are A-to-G edits and likely to be mediated by ADAR.Moreover, we found the majority of non A-to-G RDDs to be associated with poorer alignments and conclude from these results that the evidence for widespread non-canonical RDDs in humans is weak.Overall, we found RNA-Seq to be a powerful technique for surveying RDDs genome-wide when coupled with the appropriate thresholds and filters.

View Article: PubMed Central - PubMed

Affiliation: Genomics and Computational Biology Graduate Program, University of Pennsylvania School of Medicine, Philadelphia, PA, United States of America.

ABSTRACT
Advances in sequencing technology have allowed for detailed analyses of the transcriptome at single-nucleotide resolution, facilitating the study of RNA editing or sequence differences between RNA and DNA genome-wide. In humans, two types of post-transcriptional RNA editing processes are known to occur: A-to-I deamination by ADAR and C-to-U deamination by APOBEC1. In addition to these sequence differences, researchers have reported the existence of all 12 types of RNA-DNA sequence differences (RDDs); however, the validity of these claims is debated, as many studies claim that technical artifacts account for the majority of these non-canonical sequence differences. In this study, we used a detection theory approach to evaluate the performance of RNA-Sequencing (RNA-Seq) and associated aligners in accurately identifying RNA-DNA sequence differences. By generating simulated RNA-Seq datasets containing RDDs, we assessed the effect of alignment artifacts and sequencing error on the sensitivity and false discovery rate of RDD detection. Overall, we found that even in the presence of sequencing errors, false negative and false discovery rates of RDD detection can be contained below 10% with relatively lenient thresholds. We also assessed the ability of various filters to target false positive RDDs and found them to be effective in discriminating between true and false positives. Lastly, we used the optimal thresholds we identified from our simulated analyses to identify RDDs in a human lymphoblastoid cell line. We found approximately 6,000 RDDs, the majority of which are A-to-G edits and likely to be mediated by ADAR. Moreover, we found the majority of non A-to-G RDDs to be associated with poorer alignments and conclude from these results that the evidence for widespread non-canonical RDDs in humans is weak. Overall, we found RNA-Seq to be a powerful technique for surveying RDDs genome-wide when coupled with the appropriate thresholds and filters.

Show MeSH
Related in: MedlinePlus