Limits...
Characterization of human pseudogene-derived non-coding RNAs for functional potential.

Guo X, Lin M, Rockowitz S, Lachman HM, Zheng D - PLoS ONE (2014)

Bottom Line: Our analysis of the ENCODE project data also found many transcriptionally active pseudogenes in the GM12878 and K562 cell lines; moreover, it showed that many human pseudogenes produced small RNAs (sRNAs) and some pseudogene-derived sRNAs, especially those from antisense strands, exhibited evidence of interfering with gene expression.Further integrated analysis of transcriptomics and epigenomics data, however, demonstrated that trimethylation of histone 3 at lysine 9 (H3K9me3), a posttranslational modification typically associated with gene repression and heterochromatin, was enriched at many transcribed pseudogenes in a transcription-level dependent manner in the two cell lines.The H3K9me3 enrichment was more prominent in pseudogenes that produced sRNAs at pseudogene loci and their adjacent regions, an observation further supported by the co-enrichment of SETDB1 (a H3K9 methyltransferase), suggesting that pseudogene sRNAs may have a role in regional chromatin repression.

View Article: PubMed Central - PubMed

Affiliation: The Saul R. Korey Department of Neurology, Albert Einstein College of Medicine, New York, New York, United States of America.

ABSTRACT
Thousands of pseudogenes exist in the human genome and many are transcribed, but their functional potential remains elusive and understudied. To explore these issues systematically, we first developed a computational pipeline to identify transcribed pseudogenes from RNA-Seq data. Applying the pipeline to datasets from 16 distinct normal human tissues identified ∼ 3,000 pseudogenes that could produce non-coding RNAs in a manner of low abundance but high tissue specificity under normal physiological conditions. Cross-tissue comparison revealed that the transcriptional profiles of pseudogenes and their parent genes showed mostly positive correlations, suggesting that pseudogene transcription could have a positive effect on the expression of their parent genes, perhaps by functioning as competing endogenous RNAs (ceRNAs), as previously suggested and demonstrated with the PTEN pseudogene, PTENP1. Our analysis of the ENCODE project data also found many transcriptionally active pseudogenes in the GM12878 and K562 cell lines; moreover, it showed that many human pseudogenes produced small RNAs (sRNAs) and some pseudogene-derived sRNAs, especially those from antisense strands, exhibited evidence of interfering with gene expression. Further integrated analysis of transcriptomics and epigenomics data, however, demonstrated that trimethylation of histone 3 at lysine 9 (H3K9me3), a posttranslational modification typically associated with gene repression and heterochromatin, was enriched at many transcribed pseudogenes in a transcription-level dependent manner in the two cell lines. The H3K9me3 enrichment was more prominent in pseudogenes that produced sRNAs at pseudogene loci and their adjacent regions, an observation further supported by the co-enrichment of SETDB1 (a H3K9 methyltransferase), suggesting that pseudogene sRNAs may have a role in regional chromatin repression. Taken together, our comprehensive and systematic characterization of pseudogene transcription uncovers a complex picture of how pseudogene ncRNAs could influence gene and pseudogene expression, at both epigenetic and post-transcriptional levels.

Show MeSH

Related in: MedlinePlus

Identification of transcribed pseudogenes from RNA-Seq data.A) A schematic illustration of the key concept of filtering out reads not-uniquely matched to pseudogenes. Black and gray arrows represent perfectly matched and mismatched RNA-Seq reads, respectively, and the matched locations were kept. Yellow arrows represent a read initially put on a processed pseudogene but mapped back to the parent, based on aligning reads to coding sequences, because it is from an exon-exon junction. Green lines denote identical short sequences shared between gene and pseudogene. The left and right cartoons represent processed and duplicated pseudogenes, respectively. The bottom plots final read coverage on a pseudogene (red) and its parent (black), indicating that RNA-Seq signals have largely been resolved. B) Filtering effectively reduces the correlation between the number of mapped reads and sequence identity of a pseudogene to its parental gene. The number of mapped reads (y-axis) within every 200-bp region of a pseudogene is plotted against this region's sequence identity (x-axis) to the parental gene. Representative data for two tissues (brain and heart) were shown (top, before filtering; bottom, after filtering). C) Distributions of transcription values (i.e., FPKMs) of pseudogenes in all 16 tissues (the two vertical dash lines mark 1 and 10 FPKM, respectively). D) Distributions of the maximal FPKMs for lincRNAs, pseudogenes, their parents, and the rest of coding genes.
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3974860&req=5

pone-0093972-g001: Identification of transcribed pseudogenes from RNA-Seq data.A) A schematic illustration of the key concept of filtering out reads not-uniquely matched to pseudogenes. Black and gray arrows represent perfectly matched and mismatched RNA-Seq reads, respectively, and the matched locations were kept. Yellow arrows represent a read initially put on a processed pseudogene but mapped back to the parent, based on aligning reads to coding sequences, because it is from an exon-exon junction. Green lines denote identical short sequences shared between gene and pseudogene. The left and right cartoons represent processed and duplicated pseudogenes, respectively. The bottom plots final read coverage on a pseudogene (red) and its parent (black), indicating that RNA-Seq signals have largely been resolved. B) Filtering effectively reduces the correlation between the number of mapped reads and sequence identity of a pseudogene to its parental gene. The number of mapped reads (y-axis) within every 200-bp region of a pseudogene is plotted against this region's sequence identity (x-axis) to the parental gene. Representative data for two tissues (brain and heart) were shown (top, before filtering; bottom, after filtering). C) Distributions of transcription values (i.e., FPKMs) of pseudogenes in all 16 tissues (the two vertical dash lines mark 1 and 10 FPKM, respectively). D) Distributions of the maximal FPKMs for lincRNAs, pseudogenes, their parents, and the rest of coding genes.

Mentions: A major challenge in detecting transcribed pseudogenes is how to map RNA-Seq reads back to their genuine origins when both pseudogenes and their parents are candidates because of their high sequence similarity. The lack of introns may even make a processed pseudogene the preferred candidate for reads originating from exon-exon junctions of the parent. To address these issues directly, we have designed a new computational method to filter out RNA-Seq signals that are likely to have originated from coding genes but can be mapped to pseudogenes due to ambiguity (Fig. 1A, see Material and Methods). For examples, exon-exon junction reads originated from parental genes were removed from pseudogene loci by our method even though their mapping to a processed pseudogene could have a greater alignment score. Without filtering, reads were overwhelmingly mapped to pseudogene regions with >80% sequence identity to their parents, and a positive correlation existed between the number of reads mapped to a pseudogene region and the parental-pseudogene sequence identity (Fig. 1B, top panels). This pattern disappeared after our filtering (Fig. 1B, bottom), indicating that the resulting RNA-Seq signals used for our subsequent pseudogene analyses to be described below were unlikely affected significantly by reads arisen from parental genes. It also suggests that careful filtering of RNA-Seq reads by an extra step of read alignment to the human transcriptome (see Methods) is critical. This has not been explicitly considered in previous identification of transcribed pseudogenes, although in those studies investigators performed other downstream analysis to reduce the contribution of parental transcription signals to pseudogenes [11], [12]. In addition, the majority of current annotated pseudogenes (87.3% out of a total of 11, 205) share <90% sequence identity to their parents (Fig. S1), which would provide on average of ≥5 informative mismatching sites for distinguishing a true pseudogene read from a presumably parent-originating read, given that the length of our RNA-Seq reads is 50–75 bases. In summary, these results indicate that the RNA-Seq signals attributed to pseudogenes by our new computational method are reliable.


Characterization of human pseudogene-derived non-coding RNAs for functional potential.

Guo X, Lin M, Rockowitz S, Lachman HM, Zheng D - PLoS ONE (2014)

Identification of transcribed pseudogenes from RNA-Seq data.A) A schematic illustration of the key concept of filtering out reads not-uniquely matched to pseudogenes. Black and gray arrows represent perfectly matched and mismatched RNA-Seq reads, respectively, and the matched locations were kept. Yellow arrows represent a read initially put on a processed pseudogene but mapped back to the parent, based on aligning reads to coding sequences, because it is from an exon-exon junction. Green lines denote identical short sequences shared between gene and pseudogene. The left and right cartoons represent processed and duplicated pseudogenes, respectively. The bottom plots final read coverage on a pseudogene (red) and its parent (black), indicating that RNA-Seq signals have largely been resolved. B) Filtering effectively reduces the correlation between the number of mapped reads and sequence identity of a pseudogene to its parental gene. The number of mapped reads (y-axis) within every 200-bp region of a pseudogene is plotted against this region's sequence identity (x-axis) to the parental gene. Representative data for two tissues (brain and heart) were shown (top, before filtering; bottom, after filtering). C) Distributions of transcription values (i.e., FPKMs) of pseudogenes in all 16 tissues (the two vertical dash lines mark 1 and 10 FPKM, respectively). D) Distributions of the maximal FPKMs for lincRNAs, pseudogenes, their parents, and the rest of coding genes.
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3974860&req=5

pone-0093972-g001: Identification of transcribed pseudogenes from RNA-Seq data.A) A schematic illustration of the key concept of filtering out reads not-uniquely matched to pseudogenes. Black and gray arrows represent perfectly matched and mismatched RNA-Seq reads, respectively, and the matched locations were kept. Yellow arrows represent a read initially put on a processed pseudogene but mapped back to the parent, based on aligning reads to coding sequences, because it is from an exon-exon junction. Green lines denote identical short sequences shared between gene and pseudogene. The left and right cartoons represent processed and duplicated pseudogenes, respectively. The bottom plots final read coverage on a pseudogene (red) and its parent (black), indicating that RNA-Seq signals have largely been resolved. B) Filtering effectively reduces the correlation between the number of mapped reads and sequence identity of a pseudogene to its parental gene. The number of mapped reads (y-axis) within every 200-bp region of a pseudogene is plotted against this region's sequence identity (x-axis) to the parental gene. Representative data for two tissues (brain and heart) were shown (top, before filtering; bottom, after filtering). C) Distributions of transcription values (i.e., FPKMs) of pseudogenes in all 16 tissues (the two vertical dash lines mark 1 and 10 FPKM, respectively). D) Distributions of the maximal FPKMs for lincRNAs, pseudogenes, their parents, and the rest of coding genes.
Mentions: A major challenge in detecting transcribed pseudogenes is how to map RNA-Seq reads back to their genuine origins when both pseudogenes and their parents are candidates because of their high sequence similarity. The lack of introns may even make a processed pseudogene the preferred candidate for reads originating from exon-exon junctions of the parent. To address these issues directly, we have designed a new computational method to filter out RNA-Seq signals that are likely to have originated from coding genes but can be mapped to pseudogenes due to ambiguity (Fig. 1A, see Material and Methods). For examples, exon-exon junction reads originated from parental genes were removed from pseudogene loci by our method even though their mapping to a processed pseudogene could have a greater alignment score. Without filtering, reads were overwhelmingly mapped to pseudogene regions with >80% sequence identity to their parents, and a positive correlation existed between the number of reads mapped to a pseudogene region and the parental-pseudogene sequence identity (Fig. 1B, top panels). This pattern disappeared after our filtering (Fig. 1B, bottom), indicating that the resulting RNA-Seq signals used for our subsequent pseudogene analyses to be described below were unlikely affected significantly by reads arisen from parental genes. It also suggests that careful filtering of RNA-Seq reads by an extra step of read alignment to the human transcriptome (see Methods) is critical. This has not been explicitly considered in previous identification of transcribed pseudogenes, although in those studies investigators performed other downstream analysis to reduce the contribution of parental transcription signals to pseudogenes [11], [12]. In addition, the majority of current annotated pseudogenes (87.3% out of a total of 11, 205) share <90% sequence identity to their parents (Fig. S1), which would provide on average of ≥5 informative mismatching sites for distinguishing a true pseudogene read from a presumably parent-originating read, given that the length of our RNA-Seq reads is 50–75 bases. In summary, these results indicate that the RNA-Seq signals attributed to pseudogenes by our new computational method are reliable.

Bottom Line: Our analysis of the ENCODE project data also found many transcriptionally active pseudogenes in the GM12878 and K562 cell lines; moreover, it showed that many human pseudogenes produced small RNAs (sRNAs) and some pseudogene-derived sRNAs, especially those from antisense strands, exhibited evidence of interfering with gene expression.Further integrated analysis of transcriptomics and epigenomics data, however, demonstrated that trimethylation of histone 3 at lysine 9 (H3K9me3), a posttranslational modification typically associated with gene repression and heterochromatin, was enriched at many transcribed pseudogenes in a transcription-level dependent manner in the two cell lines.The H3K9me3 enrichment was more prominent in pseudogenes that produced sRNAs at pseudogene loci and their adjacent regions, an observation further supported by the co-enrichment of SETDB1 (a H3K9 methyltransferase), suggesting that pseudogene sRNAs may have a role in regional chromatin repression.

View Article: PubMed Central - PubMed

Affiliation: The Saul R. Korey Department of Neurology, Albert Einstein College of Medicine, New York, New York, United States of America.

ABSTRACT
Thousands of pseudogenes exist in the human genome and many are transcribed, but their functional potential remains elusive and understudied. To explore these issues systematically, we first developed a computational pipeline to identify transcribed pseudogenes from RNA-Seq data. Applying the pipeline to datasets from 16 distinct normal human tissues identified ∼ 3,000 pseudogenes that could produce non-coding RNAs in a manner of low abundance but high tissue specificity under normal physiological conditions. Cross-tissue comparison revealed that the transcriptional profiles of pseudogenes and their parent genes showed mostly positive correlations, suggesting that pseudogene transcription could have a positive effect on the expression of their parent genes, perhaps by functioning as competing endogenous RNAs (ceRNAs), as previously suggested and demonstrated with the PTEN pseudogene, PTENP1. Our analysis of the ENCODE project data also found many transcriptionally active pseudogenes in the GM12878 and K562 cell lines; moreover, it showed that many human pseudogenes produced small RNAs (sRNAs) and some pseudogene-derived sRNAs, especially those from antisense strands, exhibited evidence of interfering with gene expression. Further integrated analysis of transcriptomics and epigenomics data, however, demonstrated that trimethylation of histone 3 at lysine 9 (H3K9me3), a posttranslational modification typically associated with gene repression and heterochromatin, was enriched at many transcribed pseudogenes in a transcription-level dependent manner in the two cell lines. The H3K9me3 enrichment was more prominent in pseudogenes that produced sRNAs at pseudogene loci and their adjacent regions, an observation further supported by the co-enrichment of SETDB1 (a H3K9 methyltransferase), suggesting that pseudogene sRNAs may have a role in regional chromatin repression. Taken together, our comprehensive and systematic characterization of pseudogene transcription uncovers a complex picture of how pseudogene ncRNAs could influence gene and pseudogene expression, at both epigenetic and post-transcriptional levels.

Show MeSH
Related in: MedlinePlus