Limits...
Detection and classification of peaks in 5' cap RNA sequencing data.

Strbenac D, Armstrong NJ, Yang JY - BMC Genomics (2013)

Bottom Line: We develop a two-stage approach consisting of a fast and simple algorithm based on a sliding window with Poisson distribution for detecting the genomic locations of peaks, followed by a linear support vector machine classifier to distinguish between peaks which represent the initiation of transcription and peaks that do not.Addition of pooled external data or matched RNA sequencing data resulted in gains of recall with equivalent precision.It is orders of magnitude faster than doing whole genome segmentation.

View Article: PubMed Central - HTML - PubMed

ABSTRACT

Background: The large-scale sequencing of 5' cap enriched cDNA promises to reveal the diversity of transcription initiation across entire genomes. The process of transcription is noisy, and there is often no single, exact start site. This creates the need for a fast and simple method of identifying transcription start peaks based on this type of data. Due to both biological and technical noise, many of the peaks seen are not real transcription initiation events. Classification of the observed peaks is an essential filtering step in the discovery of genuine initiation locations.

Results: We develop a two-stage approach consisting of a fast and simple algorithm based on a sliding window with Poisson distribution for detecting the genomic locations of peaks, followed by a linear support vector machine classifier to distinguish between peaks which represent the initiation of transcription and peaks that do not. Comparison of classification performance to the best existing method based on whole genome segmentation showed comparable precision and improved recall. Internal features, which are intrinsic to the data and require no further experiments, had high precision and recall rates. Addition of pooled external data or matched RNA sequencing data resulted in gains of recall with equivalent precision.

Conclusions: The Poisson sliding window model is an effective and fast way of taking the peak neighbourhood into account, and finding statistically significant peaks over a range of transcript expression values. It is orders of magnitude faster than doing whole genome segmentation. The support vector classification scheme has better precision and recall than existing methods. Integrating additional datasets is shown to provide minor gains in recall, in comparison to using only the cap-sequencing data.

Show MeSH

Related in: MedlinePlus

Peaks found by the algorithm in six cell lines. CAGE-seq strand-specific read coverage is shown along with tracks that contain boxes representing the areas that are found to be peaks. A narrow peak of between 1 and 3 bases wide is at the left side of the figure. A wider peak that varies between 76 bases and 207 bases is located at the right side.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3852351&req=5

Figure 2: Peaks found by the algorithm in six cell lines. CAGE-seq strand-specific read coverage is shown along with tracks that contain boxes representing the areas that are found to be peaks. A narrow peak of between 1 and 3 bases wide is at the left side of the figure. A wider peak that varies between 76 bases and 207 bases is located at the right side.

Mentions: The local Poisson thresholding algorithm discovered tens of thousands of peaks in each sample (Table 2). About twice as many peaks were found for the H1-hESC cell line compared to any other cell line. This is biologically expected, because stem cells have open chromatin and transcription of many tissue-specific transcripts occurs, which are otherwise silenced in differentiated cells [28]. Manual exploration of coverage tracks showed that the algorithm finds peaks both broad and narrow (Figure 2). By definition, the algorithm will not find extremely broad peaks that were rarely observed, some of which are thousands of bases wide. However, based on current biological knowledge, these peaks are not likely to be real TSS regions, and were observed to overlap with known long 3' UTRs.


Detection and classification of peaks in 5' cap RNA sequencing data.

Strbenac D, Armstrong NJ, Yang JY - BMC Genomics (2013)

Peaks found by the algorithm in six cell lines. CAGE-seq strand-specific read coverage is shown along with tracks that contain boxes representing the areas that are found to be peaks. A narrow peak of between 1 and 3 bases wide is at the left side of the figure. A wider peak that varies between 76 bases and 207 bases is located at the right side.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3852351&req=5

Figure 2: Peaks found by the algorithm in six cell lines. CAGE-seq strand-specific read coverage is shown along with tracks that contain boxes representing the areas that are found to be peaks. A narrow peak of between 1 and 3 bases wide is at the left side of the figure. A wider peak that varies between 76 bases and 207 bases is located at the right side.
Mentions: The local Poisson thresholding algorithm discovered tens of thousands of peaks in each sample (Table 2). About twice as many peaks were found for the H1-hESC cell line compared to any other cell line. This is biologically expected, because stem cells have open chromatin and transcription of many tissue-specific transcripts occurs, which are otherwise silenced in differentiated cells [28]. Manual exploration of coverage tracks showed that the algorithm finds peaks both broad and narrow (Figure 2). By definition, the algorithm will not find extremely broad peaks that were rarely observed, some of which are thousands of bases wide. However, based on current biological knowledge, these peaks are not likely to be real TSS regions, and were observed to overlap with known long 3' UTRs.

Bottom Line: We develop a two-stage approach consisting of a fast and simple algorithm based on a sliding window with Poisson distribution for detecting the genomic locations of peaks, followed by a linear support vector machine classifier to distinguish between peaks which represent the initiation of transcription and peaks that do not.Addition of pooled external data or matched RNA sequencing data resulted in gains of recall with equivalent precision.It is orders of magnitude faster than doing whole genome segmentation.

View Article: PubMed Central - HTML - PubMed

ABSTRACT

Background: The large-scale sequencing of 5' cap enriched cDNA promises to reveal the diversity of transcription initiation across entire genomes. The process of transcription is noisy, and there is often no single, exact start site. This creates the need for a fast and simple method of identifying transcription start peaks based on this type of data. Due to both biological and technical noise, many of the peaks seen are not real transcription initiation events. Classification of the observed peaks is an essential filtering step in the discovery of genuine initiation locations.

Results: We develop a two-stage approach consisting of a fast and simple algorithm based on a sliding window with Poisson distribution for detecting the genomic locations of peaks, followed by a linear support vector machine classifier to distinguish between peaks which represent the initiation of transcription and peaks that do not. Comparison of classification performance to the best existing method based on whole genome segmentation showed comparable precision and improved recall. Internal features, which are intrinsic to the data and require no further experiments, had high precision and recall rates. Addition of pooled external data or matched RNA sequencing data resulted in gains of recall with equivalent precision.

Conclusions: The Poisson sliding window model is an effective and fast way of taking the peak neighbourhood into account, and finding statistically significant peaks over a range of transcript expression values. It is orders of magnitude faster than doing whole genome segmentation. The support vector classification scheme has better precision and recall than existing methods. Integrating additional datasets is shown to provide minor gains in recall, in comparison to using only the cap-sequencing data.

Show MeSH
Related in: MedlinePlus