Limits...
Predicting promoter activities of primary human DNA sequences.

Irie T, Park SJ, Yamashita R, Seki M, Yada T, Sugano S, Nakai K, Suzuki Y - Nucleic Acids Res. (2011)

Bottom Line: We found that it is still difficult to predict transcript levels in a strictly quantitative manner in vivo; however, it was possible to select active promoters in a given cell from the other silent promoters.We demonstrate that many human genomic regions have potential promoter activity, and the expression of some previously uncharacterized putatively non-protein-coding transcripts can be explained by our prediction model.Furthermore, we found that nucleosomes occasionally formed open chromatin structures with RNA polymerase II recruitment where the program predicted significant promoter activities, although no transcripts were observed.

View Article: PubMed Central - PubMed

Affiliation: Department of Medical Genome Sciences, Graduate School of Frontier Sciences, the University of Tokyo, 5-1-5 Kashiwanoha, Kashiwashi, Chiba 277-8562, Japan.

ABSTRACT
We developed a computer program that can predict the intrinsic promoter activities of primary human DNA sequences. We observed promoter activity using a quantitative luciferase assay and generated a prediction model using multiple linear regression. Our program achieved a prediction accuracy correlation coefficient of 0.87 between the predicted and observed promoter activities. We evaluated the prediction accuracy of the program using massive sequencing analysis of transcriptional start sites in vivo. We found that it is still difficult to predict transcript levels in a strictly quantitative manner in vivo; however, it was possible to select active promoters in a given cell from the other silent promoters. Using this program, we analyzed the transcriptional landscape of the entire human genome. We demonstrate that many human genomic regions have potential promoter activity, and the expression of some previously uncharacterized putatively non-protein-coding transcripts can be explained by our prediction model. Furthermore, we found that nucleosomes occasionally formed open chromatin structures with RNA polymerase II recruitment where the program predicted significant promoter activities, although no transcripts were observed.

Show MeSH

Related in: MedlinePlus

Validation of the prediction model in vivo using TSS-Seq and pol II ChIP-Seq data. (A) Distribution of the 5′-end regions of RefSeq genes with observed TSS tag counts (indicated by the y-axis; left side) and prediction scores (indicated by the x-axis). Red, blue and green bars represent populations indicated in the inset. Red, blue and green lines represent the frequencies of the promoters with pol II binding, as detected by ChIP-Seq (indicated by the y-axis; right side). (B) and (C) are examples of digital TSS tag counts (red bars), pol II binding (green bars) and predicted promoter activities (blue bars) in the RefSeq regions. (B) Exemplifies a case in which all three types of data concordantly indicate the active transcription of the gene. (C) Exemplifies a case in which our model predicted significant promoter activity, although no TSS tags were identified from the corresponding genomic region. The pale-blue line indicates a prediction score of 1.
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3113590&req=5

Figure 4: Validation of the prediction model in vivo using TSS-Seq and pol II ChIP-Seq data. (A) Distribution of the 5′-end regions of RefSeq genes with observed TSS tag counts (indicated by the y-axis; left side) and prediction scores (indicated by the x-axis). Red, blue and green bars represent populations indicated in the inset. Red, blue and green lines represent the frequencies of the promoters with pol II binding, as detected by ChIP-Seq (indicated by the y-axis; right side). (B) and (C) are examples of digital TSS tag counts (red bars), pol II binding (green bars) and predicted promoter activities (blue bars) in the RefSeq regions. (B) Exemplifies a case in which all three types of data concordantly indicate the active transcription of the gene. (C) Exemplifies a case in which our model predicted significant promoter activity, although no TSS tags were identified from the corresponding genomic region. The pale-blue line indicates a prediction score of 1.

Mentions: To evaluate the prediction model in a qualitative manner, we examined whether the RefSeq genes with significant expression in HEK293 cells could be separated from silent RefSeq genes. We used a threshold of 5 parts per million (ppm), which is roughly estimated to be five copies of the transcript per cell, assuming that every cell has 1 million mRNA transcripts (28). We determined that promoters with >5 ppm TSS tags should have clearly detectable transcript levels. 5622 cases of RefSeq promoters had 0 < TSS < 5 ppm tag counts and were excluded in this analysis, but the results of a similar analysis using different TSS-Seq tag levels are shown in Supplementary Figure S6.) We compared the distributions of the predicted promoter activities between the RefSeq 5′-end regions with >5 ppm TSS-Seq tags to RefSeq 5′-end regions without TSS tags and to randomly selected genomic regions (Figure 4A). We found clear differences in the distributions between them (P < 1 × 10−100; Wilcoxon rank test). Of 18 686 RefSeq genes, 4749 (25%) had >5 ppm TSS tags in HEK293 cells. Of these, 3922 (83%) had prediction scores >1. Precision and recall of the model to predict TSS tags at this cut-off was (Precision, Recall) = (0.52, 0.83). When TSS tags having >1ppm is also allowed, (Precision, Recall) became (0.63, 0.83). (Precision and Recall using other cut-offs are summarized in Supplementary Figure S6).Figure 4.


Predicting promoter activities of primary human DNA sequences.

Irie T, Park SJ, Yamashita R, Seki M, Yada T, Sugano S, Nakai K, Suzuki Y - Nucleic Acids Res. (2011)

Validation of the prediction model in vivo using TSS-Seq and pol II ChIP-Seq data. (A) Distribution of the 5′-end regions of RefSeq genes with observed TSS tag counts (indicated by the y-axis; left side) and prediction scores (indicated by the x-axis). Red, blue and green bars represent populations indicated in the inset. Red, blue and green lines represent the frequencies of the promoters with pol II binding, as detected by ChIP-Seq (indicated by the y-axis; right side). (B) and (C) are examples of digital TSS tag counts (red bars), pol II binding (green bars) and predicted promoter activities (blue bars) in the RefSeq regions. (B) Exemplifies a case in which all three types of data concordantly indicate the active transcription of the gene. (C) Exemplifies a case in which our model predicted significant promoter activity, although no TSS tags were identified from the corresponding genomic region. The pale-blue line indicates a prediction score of 1.
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3113590&req=5

Figure 4: Validation of the prediction model in vivo using TSS-Seq and pol II ChIP-Seq data. (A) Distribution of the 5′-end regions of RefSeq genes with observed TSS tag counts (indicated by the y-axis; left side) and prediction scores (indicated by the x-axis). Red, blue and green bars represent populations indicated in the inset. Red, blue and green lines represent the frequencies of the promoters with pol II binding, as detected by ChIP-Seq (indicated by the y-axis; right side). (B) and (C) are examples of digital TSS tag counts (red bars), pol II binding (green bars) and predicted promoter activities (blue bars) in the RefSeq regions. (B) Exemplifies a case in which all three types of data concordantly indicate the active transcription of the gene. (C) Exemplifies a case in which our model predicted significant promoter activity, although no TSS tags were identified from the corresponding genomic region. The pale-blue line indicates a prediction score of 1.
Mentions: To evaluate the prediction model in a qualitative manner, we examined whether the RefSeq genes with significant expression in HEK293 cells could be separated from silent RefSeq genes. We used a threshold of 5 parts per million (ppm), which is roughly estimated to be five copies of the transcript per cell, assuming that every cell has 1 million mRNA transcripts (28). We determined that promoters with >5 ppm TSS tags should have clearly detectable transcript levels. 5622 cases of RefSeq promoters had 0 < TSS < 5 ppm tag counts and were excluded in this analysis, but the results of a similar analysis using different TSS-Seq tag levels are shown in Supplementary Figure S6.) We compared the distributions of the predicted promoter activities between the RefSeq 5′-end regions with >5 ppm TSS-Seq tags to RefSeq 5′-end regions without TSS tags and to randomly selected genomic regions (Figure 4A). We found clear differences in the distributions between them (P < 1 × 10−100; Wilcoxon rank test). Of 18 686 RefSeq genes, 4749 (25%) had >5 ppm TSS tags in HEK293 cells. Of these, 3922 (83%) had prediction scores >1. Precision and recall of the model to predict TSS tags at this cut-off was (Precision, Recall) = (0.52, 0.83). When TSS tags having >1ppm is also allowed, (Precision, Recall) became (0.63, 0.83). (Precision and Recall using other cut-offs are summarized in Supplementary Figure S6).Figure 4.

Bottom Line: We found that it is still difficult to predict transcript levels in a strictly quantitative manner in vivo; however, it was possible to select active promoters in a given cell from the other silent promoters.We demonstrate that many human genomic regions have potential promoter activity, and the expression of some previously uncharacterized putatively non-protein-coding transcripts can be explained by our prediction model.Furthermore, we found that nucleosomes occasionally formed open chromatin structures with RNA polymerase II recruitment where the program predicted significant promoter activities, although no transcripts were observed.

View Article: PubMed Central - PubMed

Affiliation: Department of Medical Genome Sciences, Graduate School of Frontier Sciences, the University of Tokyo, 5-1-5 Kashiwanoha, Kashiwashi, Chiba 277-8562, Japan.

ABSTRACT
We developed a computer program that can predict the intrinsic promoter activities of primary human DNA sequences. We observed promoter activity using a quantitative luciferase assay and generated a prediction model using multiple linear regression. Our program achieved a prediction accuracy correlation coefficient of 0.87 between the predicted and observed promoter activities. We evaluated the prediction accuracy of the program using massive sequencing analysis of transcriptional start sites in vivo. We found that it is still difficult to predict transcript levels in a strictly quantitative manner in vivo; however, it was possible to select active promoters in a given cell from the other silent promoters. Using this program, we analyzed the transcriptional landscape of the entire human genome. We demonstrate that many human genomic regions have potential promoter activity, and the expression of some previously uncharacterized putatively non-protein-coding transcripts can be explained by our prediction model. Furthermore, we found that nucleosomes occasionally formed open chromatin structures with RNA polymerase II recruitment where the program predicted significant promoter activities, although no transcripts were observed.

Show MeSH
Related in: MedlinePlus