Limits...
Detrimental effects of duplicate reads and low complexity regions on RNA- and ChIP-seq data.

Dozmorov MG, Adrianto I, Giles CB, Glass E, Glenn SB, Montgomery C, Sivils KL, Olson LE, Iwayama T, Freeman WM, Lessard CJ, Wren JD - BMC Bioinformatics (2015)

Bottom Line: We investigate how trimming the adapters, removing duplicates, and filtering out reads overlapping low complexity regions influence the significance of biological signal in RNA- and ChIP-seq experiments.We assessed the effect of data processing steps on the alignment statistics and the functional enrichment analysis results of RNA- and ChIP-seq data.We compared differentially processed RNA-seq data with matching microarray data on the same patient samples to determine whether changes in pre-processing improved correlation between the two.

View Article: PubMed Central - HTML - PubMed

ABSTRACT

Background: Adapter trimming and removal of duplicate reads are common practices in next-generation sequencing pipelines. Sequencing reads ambiguously mapped to repetitive and low complexity regions can also be problematic for accurate assessment of the biological signal, yet their impact on sequencing data has not received much attention. We investigate how trimming the adapters, removing duplicates, and filtering out reads overlapping low complexity regions influence the significance of biological signal in RNA- and ChIP-seq experiments.

Methods: We assessed the effect of data processing steps on the alignment statistics and the functional enrichment analysis results of RNA- and ChIP-seq data. We compared differentially processed RNA-seq data with matching microarray data on the same patient samples to determine whether changes in pre-processing improved correlation between the two. We have developed a simple tool to remove low complexity regions, RepeatSoaker, available at https://github.com/mdozmorov/RepeatSoaker, and tested its effect on the alignment statistics and the results of the enrichment analyses.

Results: Both adapter trimming and duplicate removal moderately improved the strength of biological signals in RNA-seq and ChIP-seq data. Aggressive filtering of reads overlapping with low complexity regions, as defined by RepeatMasker, further improved the strength of biological signals, and the correlation between RNA-seq and microarray gene expression data.

Conclusions: Adapter trimming and duplicates removal, coupled with filtering out reads overlapping low complexity regions, is shown to increase the quality and reliability of detecting biological signals in RNA-seq and ChIP-seq data.

No MeSH data available.


Related in: MedlinePlus

Differential expression detection. Number of differentially expressed genes detected after removing reads overlapping low complexity (LC) regions. Conditions for differential expression analysis: "all LC overlaps kept/removed" - reads touching/overlapping LC regions are either kept or removed, respectively; "25%/50%/75% LC overlaps removed" - reads overlapping LC regions at least 25%/50%/75%, respectively, are removed before differential expression analysis.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC4597324&req=5

Figure 2: Differential expression detection. Number of differentially expressed genes detected after removing reads overlapping low complexity (LC) regions. Conditions for differential expression analysis: "all LC overlaps kept/removed" - reads touching/overlapping LC regions are either kept or removed, respectively; "25%/50%/75% LC overlaps removed" - reads overlapping LC regions at least 25%/50%/75%, respectively, are removed before differential expression analysis.

Mentions: We investigated the effect of RepeatSoaker on the number of differentially expressed (DE) genes in RNA-seq data with and without duplicates (Figure 2). The total number of DE genes decreased with more stringent removal of reads with RepeatSoaker. Each condition, from no RepeatSoaker to the most stringent RepeatSoaker threshold 0%, has a unique set of genes not detected as differentially expressed in other conditions ("leaves" of the Venn diagram, Figure 2). These genes showed comparable fold change distributions as the "core" gene set detected as DE in all conditions (Figure 3A). However, the expression level of those genes was lower (Figure 3B), making them susceptible to lose their DE status upon removal of reads with RepeatSoaker. We further investigated the biological significance of those genes. These were predominantly predicted (as opposed to known) genes and metabolism-related genes, as can be seen from their enrichment analysis results (Additional File 7). This observation suggests that RepeatSoaker removes biological "noise" from the data while increasing the signal that reflects the underlying biology of the experiment.


Detrimental effects of duplicate reads and low complexity regions on RNA- and ChIP-seq data.

Dozmorov MG, Adrianto I, Giles CB, Glass E, Glenn SB, Montgomery C, Sivils KL, Olson LE, Iwayama T, Freeman WM, Lessard CJ, Wren JD - BMC Bioinformatics (2015)

Differential expression detection. Number of differentially expressed genes detected after removing reads overlapping low complexity (LC) regions. Conditions for differential expression analysis: "all LC overlaps kept/removed" - reads touching/overlapping LC regions are either kept or removed, respectively; "25%/50%/75% LC overlaps removed" - reads overlapping LC regions at least 25%/50%/75%, respectively, are removed before differential expression analysis.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC4597324&req=5

Figure 2: Differential expression detection. Number of differentially expressed genes detected after removing reads overlapping low complexity (LC) regions. Conditions for differential expression analysis: "all LC overlaps kept/removed" - reads touching/overlapping LC regions are either kept or removed, respectively; "25%/50%/75% LC overlaps removed" - reads overlapping LC regions at least 25%/50%/75%, respectively, are removed before differential expression analysis.
Mentions: We investigated the effect of RepeatSoaker on the number of differentially expressed (DE) genes in RNA-seq data with and without duplicates (Figure 2). The total number of DE genes decreased with more stringent removal of reads with RepeatSoaker. Each condition, from no RepeatSoaker to the most stringent RepeatSoaker threshold 0%, has a unique set of genes not detected as differentially expressed in other conditions ("leaves" of the Venn diagram, Figure 2). These genes showed comparable fold change distributions as the "core" gene set detected as DE in all conditions (Figure 3A). However, the expression level of those genes was lower (Figure 3B), making them susceptible to lose their DE status upon removal of reads with RepeatSoaker. We further investigated the biological significance of those genes. These were predominantly predicted (as opposed to known) genes and metabolism-related genes, as can be seen from their enrichment analysis results (Additional File 7). This observation suggests that RepeatSoaker removes biological "noise" from the data while increasing the signal that reflects the underlying biology of the experiment.

Bottom Line: We investigate how trimming the adapters, removing duplicates, and filtering out reads overlapping low complexity regions influence the significance of biological signal in RNA- and ChIP-seq experiments.We assessed the effect of data processing steps on the alignment statistics and the functional enrichment analysis results of RNA- and ChIP-seq data.We compared differentially processed RNA-seq data with matching microarray data on the same patient samples to determine whether changes in pre-processing improved correlation between the two.

View Article: PubMed Central - HTML - PubMed

ABSTRACT

Background: Adapter trimming and removal of duplicate reads are common practices in next-generation sequencing pipelines. Sequencing reads ambiguously mapped to repetitive and low complexity regions can also be problematic for accurate assessment of the biological signal, yet their impact on sequencing data has not received much attention. We investigate how trimming the adapters, removing duplicates, and filtering out reads overlapping low complexity regions influence the significance of biological signal in RNA- and ChIP-seq experiments.

Methods: We assessed the effect of data processing steps on the alignment statistics and the functional enrichment analysis results of RNA- and ChIP-seq data. We compared differentially processed RNA-seq data with matching microarray data on the same patient samples to determine whether changes in pre-processing improved correlation between the two. We have developed a simple tool to remove low complexity regions, RepeatSoaker, available at https://github.com/mdozmorov/RepeatSoaker, and tested its effect on the alignment statistics and the results of the enrichment analyses.

Results: Both adapter trimming and duplicate removal moderately improved the strength of biological signals in RNA-seq and ChIP-seq data. Aggressive filtering of reads overlapping with low complexity regions, as defined by RepeatMasker, further improved the strength of biological signals, and the correlation between RNA-seq and microarray gene expression data.

Conclusions: Adapter trimming and duplicates removal, coupled with filtering out reads overlapping low complexity regions, is shown to increase the quality and reliability of detecting biological signals in RNA-seq and ChIP-seq data.

No MeSH data available.


Related in: MedlinePlus