Limits...
Predicting promoter activities of primary human DNA sequences.

Irie T, Park SJ, Yamashita R, Seki M, Yada T, Sugano S, Nakai K, Suzuki Y - Nucleic Acids Res. (2011)

Bottom Line: We found that it is still difficult to predict transcript levels in a strictly quantitative manner in vivo; however, it was possible to select active promoters in a given cell from the other silent promoters.We demonstrate that many human genomic regions have potential promoter activity, and the expression of some previously uncharacterized putatively non-protein-coding transcripts can be explained by our prediction model.Furthermore, we found that nucleosomes occasionally formed open chromatin structures with RNA polymerase II recruitment where the program predicted significant promoter activities, although no transcripts were observed.

View Article: PubMed Central - PubMed

Affiliation: Department of Medical Genome Sciences, Graduate School of Frontier Sciences, the University of Tokyo, 5-1-5 Kashiwanoha, Kashiwashi, Chiba 277-8562, Japan.

ABSTRACT
We developed a computer program that can predict the intrinsic promoter activities of primary human DNA sequences. We observed promoter activity using a quantitative luciferase assay and generated a prediction model using multiple linear regression. Our program achieved a prediction accuracy correlation coefficient of 0.87 between the predicted and observed promoter activities. We evaluated the prediction accuracy of the program using massive sequencing analysis of transcriptional start sites in vivo. We found that it is still difficult to predict transcript levels in a strictly quantitative manner in vivo; however, it was possible to select active promoters in a given cell from the other silent promoters. Using this program, we analyzed the transcriptional landscape of the entire human genome. We demonstrate that many human genomic regions have potential promoter activity, and the expression of some previously uncharacterized putatively non-protein-coding transcripts can be explained by our prediction model. Furthermore, we found that nucleosomes occasionally formed open chromatin structures with RNA polymerase II recruitment where the program predicted significant promoter activities, although no transcripts were observed.

Show MeSH

Related in: MedlinePlus

Validation of the prediction model. (A) Results of the 10-fold cross-validation test of the model. Distribution of Pearson's correlation between predicted promoter activities and observed luciferase activities is shown. (B) Validation of the TFBSs using promoter disruptant mutants. In each panel, luciferase activities of the original promoter clones (white) and the disruptant mutant clones (gray) are shown. The three upper panels represent the cases in which all promoter clones confirmed the contribution of the predicted TFBS. The three lower panels represent the cases in which predicted changes in luciferase activity were not observed for the examined promoter clones (indicated by the asterisks). In the upper margin in each panel, the ID of the TF corresponding to the examined TFBS is shown. Error bars represent standard deviations calculated from triplicate experiments.
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3113590&req=5

Figure 3: Validation of the prediction model. (A) Results of the 10-fold cross-validation test of the model. Distribution of Pearson's correlation between predicted promoter activities and observed luciferase activities is shown. (B) Validation of the TFBSs using promoter disruptant mutants. In each panel, luciferase activities of the original promoter clones (white) and the disruptant mutant clones (gray) are shown. The three upper panels represent the cases in which all promoter clones confirmed the contribution of the predicted TFBS. The three lower panels represent the cases in which predicted changes in luciferase activity were not observed for the examined promoter clones (indicated by the asterisks). In the upper margin in each panel, the ID of the TF corresponding to the examined TFBS is shown. Error bars represent standard deviations calculated from triplicate experiments.

Mentions: To evaluate the constructed prediction model, we used a 10-fold cross-validation method. With this, we evaluated the risk of over-fitting the regression model. We randomly selected 90% of the luciferase data for the training data set and used the remainder as the test data set. We repeated this test 1000 times and calculated the correlation coefficient for each (Figure 3A). Even in this open test, we found that the average Pearson's correlation coefficient was 0.83 when the fine-tuned prediction model was used, (Figure 3A). These results indicate that the constructed prediction model can be used to predict promoter activities of unknown DNA sequences. This also indicated that over-fitting effects from an excess number of parameters were relatively small in the fine-tuned model.Figure 3.


Predicting promoter activities of primary human DNA sequences.

Irie T, Park SJ, Yamashita R, Seki M, Yada T, Sugano S, Nakai K, Suzuki Y - Nucleic Acids Res. (2011)

Validation of the prediction model. (A) Results of the 10-fold cross-validation test of the model. Distribution of Pearson's correlation between predicted promoter activities and observed luciferase activities is shown. (B) Validation of the TFBSs using promoter disruptant mutants. In each panel, luciferase activities of the original promoter clones (white) and the disruptant mutant clones (gray) are shown. The three upper panels represent the cases in which all promoter clones confirmed the contribution of the predicted TFBS. The three lower panels represent the cases in which predicted changes in luciferase activity were not observed for the examined promoter clones (indicated by the asterisks). In the upper margin in each panel, the ID of the TF corresponding to the examined TFBS is shown. Error bars represent standard deviations calculated from triplicate experiments.
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3113590&req=5

Figure 3: Validation of the prediction model. (A) Results of the 10-fold cross-validation test of the model. Distribution of Pearson's correlation between predicted promoter activities and observed luciferase activities is shown. (B) Validation of the TFBSs using promoter disruptant mutants. In each panel, luciferase activities of the original promoter clones (white) and the disruptant mutant clones (gray) are shown. The three upper panels represent the cases in which all promoter clones confirmed the contribution of the predicted TFBS. The three lower panels represent the cases in which predicted changes in luciferase activity were not observed for the examined promoter clones (indicated by the asterisks). In the upper margin in each panel, the ID of the TF corresponding to the examined TFBS is shown. Error bars represent standard deviations calculated from triplicate experiments.
Mentions: To evaluate the constructed prediction model, we used a 10-fold cross-validation method. With this, we evaluated the risk of over-fitting the regression model. We randomly selected 90% of the luciferase data for the training data set and used the remainder as the test data set. We repeated this test 1000 times and calculated the correlation coefficient for each (Figure 3A). Even in this open test, we found that the average Pearson's correlation coefficient was 0.83 when the fine-tuned prediction model was used, (Figure 3A). These results indicate that the constructed prediction model can be used to predict promoter activities of unknown DNA sequences. This also indicated that over-fitting effects from an excess number of parameters were relatively small in the fine-tuned model.Figure 3.

Bottom Line: We found that it is still difficult to predict transcript levels in a strictly quantitative manner in vivo; however, it was possible to select active promoters in a given cell from the other silent promoters.We demonstrate that many human genomic regions have potential promoter activity, and the expression of some previously uncharacterized putatively non-protein-coding transcripts can be explained by our prediction model.Furthermore, we found that nucleosomes occasionally formed open chromatin structures with RNA polymerase II recruitment where the program predicted significant promoter activities, although no transcripts were observed.

View Article: PubMed Central - PubMed

Affiliation: Department of Medical Genome Sciences, Graduate School of Frontier Sciences, the University of Tokyo, 5-1-5 Kashiwanoha, Kashiwashi, Chiba 277-8562, Japan.

ABSTRACT
We developed a computer program that can predict the intrinsic promoter activities of primary human DNA sequences. We observed promoter activity using a quantitative luciferase assay and generated a prediction model using multiple linear regression. Our program achieved a prediction accuracy correlation coefficient of 0.87 between the predicted and observed promoter activities. We evaluated the prediction accuracy of the program using massive sequencing analysis of transcriptional start sites in vivo. We found that it is still difficult to predict transcript levels in a strictly quantitative manner in vivo; however, it was possible to select active promoters in a given cell from the other silent promoters. Using this program, we analyzed the transcriptional landscape of the entire human genome. We demonstrate that many human genomic regions have potential promoter activity, and the expression of some previously uncharacterized putatively non-protein-coding transcripts can be explained by our prediction model. Furthermore, we found that nucleosomes occasionally formed open chromatin structures with RNA polymerase II recruitment where the program predicted significant promoter activities, although no transcripts were observed.

Show MeSH
Related in: MedlinePlus