Limits...
Differential and coherent processing patterns from small RNAs.

Pundhir S, Gorodkin J - Sci Rep (2015)

Bottom Line: While the annotated loci predominantly consist of ~24 nt short RNAs, the unannotated loci comparatively consist of ~17 nt short RNAs.Furthermore, these ~17 nt short RNAs are significantly enriched for overlap to transcription start sites and DNase I hypersensitive sites (p-value < 0.01) that are characteristic features of transcription initiation RNAs.We discuss how the computational pipeline developed in this study has the potential to be applied to other forms of RNA-seq data for further transcriptome-wide studies of differential and coherent processing.

View Article: PubMed Central - PubMed

Affiliation: Center for non-coding RNA in Technology and Health, IKVH, University of Copenhagen, Grønnegårdsvej 3, 1870, Frederiksberg C, Denmark.

ABSTRACT
Post-transcriptional processing events related to short RNAs are often reflected in their read profile patterns emerging from high-throughput sequencing data. MicroRNA arm switching across different tissues is a well-known example of what we define as differential processing. Here, short RNAs from the nine cell lines of the ENCODE project, irrespective of their annotation status, were analyzed for genomic loci representing differential or coherent processing. We observed differential processing predominantly in RNAs annotated as miRNA, snoRNA or tRNA. Four out of five known cases of differentially processed miRNAs that were in the input dataset were recovered and several novel cases were discovered. In contrast to differential processing, coherent processing is observed widespread in both annotated and unannotated regions. While the annotated loci predominantly consist of ~24 nt short RNAs, the unannotated loci comparatively consist of ~17 nt short RNAs. Furthermore, these ~17 nt short RNAs are significantly enriched for overlap to transcription start sites and DNase I hypersensitive sites (p-value < 0.01) that are characteristic features of transcription initiation RNAs. We discuss how the computational pipeline developed in this study has the potential to be applied to other forms of RNA-seq data for further transcriptome-wide studies of differential and coherent processing.

No MeSH data available.


Various analysis steps to identify differential processing at a genomic locus.(A) Closely spaced mapped reads are grouped to define read profiles or block groups in each sample (see methods and Supplementary Fig. S1). A block group can consist of ≥1 block of reads. The numbers in the brackets represent the total read count within a block group. (B) We analyze all the genomic loci where a block group is observed in all the samples. (C) All the block groups (Supplementary Fig. S16) at a locus are aligned against each using deepBlockAlign18 to generate a square matrix of alignment scores (S). (D) A hierarchical cluster of block groups based on their alignment score is generated using pvclust, which also assigns a p-value indicating how strong the cluster is supported by the data80. We select all the clusters that have p-value < 0.05. (E) A cluster score (X) that measures how well separated the read profiles are within a cluster to read profiles outside the cluster is computed (Equation 2). We select all the loci having a cluster score of ≥0.15 as candidate differentially processed loci (DPL). (F) For each candidate DPL, we compare the normalized expression of block group from a sample to every other sample using Fisher’s exact test. (F.1) Specifically, we use m x 2 contingency matrix comprised of normalized expression of each block within the two block groups. Normalized expression of a block is computed by dividing it by a size factor (s), computed for each sample (Equation 4F.2). A ‘p-value matrix’ is created that contains p-values computed from all vs. all comparison of normalized expression of block groups at a locus. (F.3) A ‘cluster matrix’ is created for each cluster identified in step-d. It consists of p-values computed from the comparison of normalized expression of block groups inside a cluster to those outside the cluster. (G) A locus is marked as differentially processed, if more than half of the samples in at least one ‘cluster matrix’ have significant p-value (<0.05) against all the samples outside the cluster.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4499813&req=5

f6: Various analysis steps to identify differential processing at a genomic locus.(A) Closely spaced mapped reads are grouped to define read profiles or block groups in each sample (see methods and Supplementary Fig. S1). A block group can consist of ≥1 block of reads. The numbers in the brackets represent the total read count within a block group. (B) We analyze all the genomic loci where a block group is observed in all the samples. (C) All the block groups (Supplementary Fig. S16) at a locus are aligned against each using deepBlockAlign18 to generate a square matrix of alignment scores (S). (D) A hierarchical cluster of block groups based on their alignment score is generated using pvclust, which also assigns a p-value indicating how strong the cluster is supported by the data80. We select all the clusters that have p-value < 0.05. (E) A cluster score (X) that measures how well separated the read profiles are within a cluster to read profiles outside the cluster is computed (Equation 2). We select all the loci having a cluster score of ≥0.15 as candidate differentially processed loci (DPL). (F) For each candidate DPL, we compare the normalized expression of block group from a sample to every other sample using Fisher’s exact test. (F.1) Specifically, we use m x 2 contingency matrix comprised of normalized expression of each block within the two block groups. Normalized expression of a block is computed by dividing it by a size factor (s), computed for each sample (Equation 4F.2). A ‘p-value matrix’ is created that contains p-values computed from all vs. all comparison of normalized expression of block groups at a locus. (F.3) A ‘cluster matrix’ is created for each cluster identified in step-d. It consists of p-values computed from the comparison of normalized expression of block groups inside a cluster to those outside the cluster. (G) A locus is marked as differentially processed, if more than half of the samples in at least one ‘cluster matrix’ have significant p-value (<0.05) against all the samples outside the cluster.

Mentions: In an earlier study18, we have developed a tool, deepBlockAlign for the alignment of two read profiles. deepBlockAlign normalizes the read counts by the total reads within a read profile followed by a two-tier strategy to align the read profiles (see Langenberger et al.18 for details). The alignment score (S) from deepBlockAlign ranges between 0 and 1 which suggests dissimilar and perfectly similar read profiles, respectively. In this study, we have built a pipeline integrating deepBlockAlign with appropriate preprocessing and validation steps to compare all the pairs of read profiles at a genomic locus derived from RNA-seq experiments performed on multiple cell lines or tissues (N). As input, the pipeline takes each genomic locus, L where a read profile is observed here, in all N cell lines (Fig. 6A). Next, a three step preprocessing procedure is followed for each locus (Supplementary Fig. S16). This procedure ensures that: a) block groups from all N cell lines have same number of read blocks; and, b) each constituent block contains its raw expression value that may have been altered due to a fixed cut-off of 10% (-minBlockHeight) which we took while defining the block groups using blockbuster. Firstly, all the non-overlapping (<90%) coordinates Cl at l = 1..L where a read block is observed in at least one cell line are determined. Secondly, for each coordinate Cl, a single dummy read is placed followed by retrieval of all the reads that mapped at this coordinate from the BAM file (comprised of mapped reads). Thirdly, if all the block groups at a genomic locus have one read block, an adjacent dummy block is added that has an expression 10% relative to the expression of the parent block group. This is done due to the limitation of deepBlockAlign that requires at least two blocks of reads for meaningful comparison of read profiles or block groups.


Differential and coherent processing patterns from small RNAs.

Pundhir S, Gorodkin J - Sci Rep (2015)

Various analysis steps to identify differential processing at a genomic locus.(A) Closely spaced mapped reads are grouped to define read profiles or block groups in each sample (see methods and Supplementary Fig. S1). A block group can consist of ≥1 block of reads. The numbers in the brackets represent the total read count within a block group. (B) We analyze all the genomic loci where a block group is observed in all the samples. (C) All the block groups (Supplementary Fig. S16) at a locus are aligned against each using deepBlockAlign18 to generate a square matrix of alignment scores (S). (D) A hierarchical cluster of block groups based on their alignment score is generated using pvclust, which also assigns a p-value indicating how strong the cluster is supported by the data80. We select all the clusters that have p-value < 0.05. (E) A cluster score (X) that measures how well separated the read profiles are within a cluster to read profiles outside the cluster is computed (Equation 2). We select all the loci having a cluster score of ≥0.15 as candidate differentially processed loci (DPL). (F) For each candidate DPL, we compare the normalized expression of block group from a sample to every other sample using Fisher’s exact test. (F.1) Specifically, we use m x 2 contingency matrix comprised of normalized expression of each block within the two block groups. Normalized expression of a block is computed by dividing it by a size factor (s), computed for each sample (Equation 4F.2). A ‘p-value matrix’ is created that contains p-values computed from all vs. all comparison of normalized expression of block groups at a locus. (F.3) A ‘cluster matrix’ is created for each cluster identified in step-d. It consists of p-values computed from the comparison of normalized expression of block groups inside a cluster to those outside the cluster. (G) A locus is marked as differentially processed, if more than half of the samples in at least one ‘cluster matrix’ have significant p-value (<0.05) against all the samples outside the cluster.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4499813&req=5

f6: Various analysis steps to identify differential processing at a genomic locus.(A) Closely spaced mapped reads are grouped to define read profiles or block groups in each sample (see methods and Supplementary Fig. S1). A block group can consist of ≥1 block of reads. The numbers in the brackets represent the total read count within a block group. (B) We analyze all the genomic loci where a block group is observed in all the samples. (C) All the block groups (Supplementary Fig. S16) at a locus are aligned against each using deepBlockAlign18 to generate a square matrix of alignment scores (S). (D) A hierarchical cluster of block groups based on their alignment score is generated using pvclust, which also assigns a p-value indicating how strong the cluster is supported by the data80. We select all the clusters that have p-value < 0.05. (E) A cluster score (X) that measures how well separated the read profiles are within a cluster to read profiles outside the cluster is computed (Equation 2). We select all the loci having a cluster score of ≥0.15 as candidate differentially processed loci (DPL). (F) For each candidate DPL, we compare the normalized expression of block group from a sample to every other sample using Fisher’s exact test. (F.1) Specifically, we use m x 2 contingency matrix comprised of normalized expression of each block within the two block groups. Normalized expression of a block is computed by dividing it by a size factor (s), computed for each sample (Equation 4F.2). A ‘p-value matrix’ is created that contains p-values computed from all vs. all comparison of normalized expression of block groups at a locus. (F.3) A ‘cluster matrix’ is created for each cluster identified in step-d. It consists of p-values computed from the comparison of normalized expression of block groups inside a cluster to those outside the cluster. (G) A locus is marked as differentially processed, if more than half of the samples in at least one ‘cluster matrix’ have significant p-value (<0.05) against all the samples outside the cluster.
Mentions: In an earlier study18, we have developed a tool, deepBlockAlign for the alignment of two read profiles. deepBlockAlign normalizes the read counts by the total reads within a read profile followed by a two-tier strategy to align the read profiles (see Langenberger et al.18 for details). The alignment score (S) from deepBlockAlign ranges between 0 and 1 which suggests dissimilar and perfectly similar read profiles, respectively. In this study, we have built a pipeline integrating deepBlockAlign with appropriate preprocessing and validation steps to compare all the pairs of read profiles at a genomic locus derived from RNA-seq experiments performed on multiple cell lines or tissues (N). As input, the pipeline takes each genomic locus, L where a read profile is observed here, in all N cell lines (Fig. 6A). Next, a three step preprocessing procedure is followed for each locus (Supplementary Fig. S16). This procedure ensures that: a) block groups from all N cell lines have same number of read blocks; and, b) each constituent block contains its raw expression value that may have been altered due to a fixed cut-off of 10% (-minBlockHeight) which we took while defining the block groups using blockbuster. Firstly, all the non-overlapping (<90%) coordinates Cl at l = 1..L where a read block is observed in at least one cell line are determined. Secondly, for each coordinate Cl, a single dummy read is placed followed by retrieval of all the reads that mapped at this coordinate from the BAM file (comprised of mapped reads). Thirdly, if all the block groups at a genomic locus have one read block, an adjacent dummy block is added that has an expression 10% relative to the expression of the parent block group. This is done due to the limitation of deepBlockAlign that requires at least two blocks of reads for meaningful comparison of read profiles or block groups.

Bottom Line: While the annotated loci predominantly consist of ~24 nt short RNAs, the unannotated loci comparatively consist of ~17 nt short RNAs.Furthermore, these ~17 nt short RNAs are significantly enriched for overlap to transcription start sites and DNase I hypersensitive sites (p-value < 0.01) that are characteristic features of transcription initiation RNAs.We discuss how the computational pipeline developed in this study has the potential to be applied to other forms of RNA-seq data for further transcriptome-wide studies of differential and coherent processing.

View Article: PubMed Central - PubMed

Affiliation: Center for non-coding RNA in Technology and Health, IKVH, University of Copenhagen, Grønnegårdsvej 3, 1870, Frederiksberg C, Denmark.

ABSTRACT
Post-transcriptional processing events related to short RNAs are often reflected in their read profile patterns emerging from high-throughput sequencing data. MicroRNA arm switching across different tissues is a well-known example of what we define as differential processing. Here, short RNAs from the nine cell lines of the ENCODE project, irrespective of their annotation status, were analyzed for genomic loci representing differential or coherent processing. We observed differential processing predominantly in RNAs annotated as miRNA, snoRNA or tRNA. Four out of five known cases of differentially processed miRNAs that were in the input dataset were recovered and several novel cases were discovered. In contrast to differential processing, coherent processing is observed widespread in both annotated and unannotated regions. While the annotated loci predominantly consist of ~24 nt short RNAs, the unannotated loci comparatively consist of ~17 nt short RNAs. Furthermore, these ~17 nt short RNAs are significantly enriched for overlap to transcription start sites and DNase I hypersensitive sites (p-value < 0.01) that are characteristic features of transcription initiation RNAs. We discuss how the computational pipeline developed in this study has the potential to be applied to other forms of RNA-seq data for further transcriptome-wide studies of differential and coherent processing.

No MeSH data available.