Limits...
Ab initio identification of transcription start sites in the Rhesus macaque genome by histone modification and RNA-Seq.

Liu Y, Han D, Han Y, Yan Z, Xie B, Li J, Qiao N, Hu H, Khaitovich P, Gao Y, Han JD - Nucleic Acids Res. (2010)

Bottom Line: These provide an important rich resource for close examination of the species-specific transcript structures and transcription regulations in the Rhesus macaque genome.Our approach exemplifies a relatively inexpensive way to generate a reasonably reliable TSS map for a large genome.It may serve as a guiding example for similar genome annotation efforts targeted at other model organisms.

View Article: PubMed Central - PubMed

Affiliation: Chinese Academy of Sciences Key Laboratory of Molecular Developmental Biology, Center for Molecular Systems Biology, Institute of Genetics and Developmental Biology, Chinese Academy of Sciences, Lincui East Road, Beijing, 100101, Chinese.

ABSTRACT
Rhesus macaque is a widely used primate model organism. Its genome annotations are however still largely comparative computational predictions derived mainly from human genes, which precludes studies on the macaque-specific genes, gene isoforms or their regulations. Here we took advantage of histone H3 lysine 4 trimethylation (H3K4me3)'s ability to mark transcription start sites (TSSs) and the recently developed ChIP-Seq and RNA-Seq technology to survey the transcript structures. We generated 14,013,757 sequence tags by H3K4me3 ChIP-Seq and obtained 17,322,358 paired end reads for mRNA, and 10,698,419 short reads for sRNA from the macaque brain. By integrating these data with genomic sequence features and extending and improving a state-of-the-art TSS prediction algorithm, we ab initio predicted and verified 17,933 of previously electronically annotated TSSs at 500-bp resolution. We also predicted approximately 10,000 novel TSSs. These provide an important rich resource for close examination of the species-specific transcript structures and transcription regulations in the Rhesus macaque genome. Our approach exemplifies a relatively inexpensive way to generate a reasonably reliable TSS map for a large genome. It may serve as a guiding example for similar genome annotation efforts targeted at other model organisms.

Show MeSH

Related in: MedlinePlus

Evaluating the performance of the TSS classifier. (A) and (B) Correlation of the CpG (A) or non-CpG (B) TSS log-odds scores with the probability or percentage of the predicted TSSs containing electronically annotated TSSs, RNA-Seq signals, or either annotated TSSs or RNA-Seq signals within 500 bp. ‘Low’, ‘medium’, ‘high’ correspond to the sets of TSSs whose probability of being a true TSS (implied by log-odds scores) are >50, 95, 99%, while ‘minus’ denotes the predictions with negative log-odds scores. (C) ROCs of using different negative training samples for predicting CpG TSS. ‘Random’ or ‘Flanking’ means the negative training samples were randomly selected from the whole genome or the flanking regions of known TSS. ‘Flanking + Random’ means we combined the two sets above as negative training examples. Each point in an ROC shows the percentage of the TSS predictions that are supported by electronic TSS annotations within a certain distance. Given the different numbers of predictions made in different training strategies, only the top 10 000 predictions with the largest GentleBoost scores in each training scenarios are compared. (D). ROCs of using different negative training examples for predicting non-CpG TSS. The format of the graph is the same as in C. (E) and (F). ROCs of using cosine similarity compared with using PCC for predicting CpG and non-CpG TSSs, respectively. The format of the graphs is the same as in C. All TSS predictions in C–D are based on mRNA-seq validation (see section ‘TSS validation and refinement’, and those based on sRNA-seq are shown in Supplementary Figure S4). The TSS predictions in E and F are based on either mRNA-seq or sRNA-seq validation.
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3045608&req=5

Figure 2: Evaluating the performance of the TSS classifier. (A) and (B) Correlation of the CpG (A) or non-CpG (B) TSS log-odds scores with the probability or percentage of the predicted TSSs containing electronically annotated TSSs, RNA-Seq signals, or either annotated TSSs or RNA-Seq signals within 500 bp. ‘Low’, ‘medium’, ‘high’ correspond to the sets of TSSs whose probability of being a true TSS (implied by log-odds scores) are >50, 95, 99%, while ‘minus’ denotes the predictions with negative log-odds scores. (C) ROCs of using different negative training samples for predicting CpG TSS. ‘Random’ or ‘Flanking’ means the negative training samples were randomly selected from the whole genome or the flanking regions of known TSS. ‘Flanking + Random’ means we combined the two sets above as negative training examples. Each point in an ROC shows the percentage of the TSS predictions that are supported by electronic TSS annotations within a certain distance. Given the different numbers of predictions made in different training strategies, only the top 10 000 predictions with the largest GentleBoost scores in each training scenarios are compared. (D). ROCs of using different negative training examples for predicting non-CpG TSS. The format of the graph is the same as in C. (E) and (F). ROCs of using cosine similarity compared with using PCC for predicting CpG and non-CpG TSSs, respectively. The format of the graphs is the same as in C. All TSS predictions in C–D are based on mRNA-seq validation (see section ‘TSS validation and refinement’, and those based on sRNA-seq are shown in Supplementary Figure S4). The TSS predictions in E and F are based on either mRNA-seq or sRNA-seq validation.

Mentions: We find a strong correlation [Pearson correlation coefficient (PCC) = 0.879 along 100 tiles of prediction scores] of the prediction scores to the percentages of predicted TSSs that have nearby electronically annotated TSSs, suggesting that the higher the prediction score, the more likely the TSS can be validated by homology-based electronic gene annotation transfer (Figure 2A and B).Figure 2.


Ab initio identification of transcription start sites in the Rhesus macaque genome by histone modification and RNA-Seq.

Liu Y, Han D, Han Y, Yan Z, Xie B, Li J, Qiao N, Hu H, Khaitovich P, Gao Y, Han JD - Nucleic Acids Res. (2010)

Evaluating the performance of the TSS classifier. (A) and (B) Correlation of the CpG (A) or non-CpG (B) TSS log-odds scores with the probability or percentage of the predicted TSSs containing electronically annotated TSSs, RNA-Seq signals, or either annotated TSSs or RNA-Seq signals within 500 bp. ‘Low’, ‘medium’, ‘high’ correspond to the sets of TSSs whose probability of being a true TSS (implied by log-odds scores) are >50, 95, 99%, while ‘minus’ denotes the predictions with negative log-odds scores. (C) ROCs of using different negative training samples for predicting CpG TSS. ‘Random’ or ‘Flanking’ means the negative training samples were randomly selected from the whole genome or the flanking regions of known TSS. ‘Flanking + Random’ means we combined the two sets above as negative training examples. Each point in an ROC shows the percentage of the TSS predictions that are supported by electronic TSS annotations within a certain distance. Given the different numbers of predictions made in different training strategies, only the top 10 000 predictions with the largest GentleBoost scores in each training scenarios are compared. (D). ROCs of using different negative training examples for predicting non-CpG TSS. The format of the graph is the same as in C. (E) and (F). ROCs of using cosine similarity compared with using PCC for predicting CpG and non-CpG TSSs, respectively. The format of the graphs is the same as in C. All TSS predictions in C–D are based on mRNA-seq validation (see section ‘TSS validation and refinement’, and those based on sRNA-seq are shown in Supplementary Figure S4). The TSS predictions in E and F are based on either mRNA-seq or sRNA-seq validation.
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3045608&req=5

Figure 2: Evaluating the performance of the TSS classifier. (A) and (B) Correlation of the CpG (A) or non-CpG (B) TSS log-odds scores with the probability or percentage of the predicted TSSs containing electronically annotated TSSs, RNA-Seq signals, or either annotated TSSs or RNA-Seq signals within 500 bp. ‘Low’, ‘medium’, ‘high’ correspond to the sets of TSSs whose probability of being a true TSS (implied by log-odds scores) are >50, 95, 99%, while ‘minus’ denotes the predictions with negative log-odds scores. (C) ROCs of using different negative training samples for predicting CpG TSS. ‘Random’ or ‘Flanking’ means the negative training samples were randomly selected from the whole genome or the flanking regions of known TSS. ‘Flanking + Random’ means we combined the two sets above as negative training examples. Each point in an ROC shows the percentage of the TSS predictions that are supported by electronic TSS annotations within a certain distance. Given the different numbers of predictions made in different training strategies, only the top 10 000 predictions with the largest GentleBoost scores in each training scenarios are compared. (D). ROCs of using different negative training examples for predicting non-CpG TSS. The format of the graph is the same as in C. (E) and (F). ROCs of using cosine similarity compared with using PCC for predicting CpG and non-CpG TSSs, respectively. The format of the graphs is the same as in C. All TSS predictions in C–D are based on mRNA-seq validation (see section ‘TSS validation and refinement’, and those based on sRNA-seq are shown in Supplementary Figure S4). The TSS predictions in E and F are based on either mRNA-seq or sRNA-seq validation.
Mentions: We find a strong correlation [Pearson correlation coefficient (PCC) = 0.879 along 100 tiles of prediction scores] of the prediction scores to the percentages of predicted TSSs that have nearby electronically annotated TSSs, suggesting that the higher the prediction score, the more likely the TSS can be validated by homology-based electronic gene annotation transfer (Figure 2A and B).Figure 2.

Bottom Line: These provide an important rich resource for close examination of the species-specific transcript structures and transcription regulations in the Rhesus macaque genome.Our approach exemplifies a relatively inexpensive way to generate a reasonably reliable TSS map for a large genome.It may serve as a guiding example for similar genome annotation efforts targeted at other model organisms.

View Article: PubMed Central - PubMed

Affiliation: Chinese Academy of Sciences Key Laboratory of Molecular Developmental Biology, Center for Molecular Systems Biology, Institute of Genetics and Developmental Biology, Chinese Academy of Sciences, Lincui East Road, Beijing, 100101, Chinese.

ABSTRACT
Rhesus macaque is a widely used primate model organism. Its genome annotations are however still largely comparative computational predictions derived mainly from human genes, which precludes studies on the macaque-specific genes, gene isoforms or their regulations. Here we took advantage of histone H3 lysine 4 trimethylation (H3K4me3)'s ability to mark transcription start sites (TSSs) and the recently developed ChIP-Seq and RNA-Seq technology to survey the transcript structures. We generated 14,013,757 sequence tags by H3K4me3 ChIP-Seq and obtained 17,322,358 paired end reads for mRNA, and 10,698,419 short reads for sRNA from the macaque brain. By integrating these data with genomic sequence features and extending and improving a state-of-the-art TSS prediction algorithm, we ab initio predicted and verified 17,933 of previously electronically annotated TSSs at 500-bp resolution. We also predicted approximately 10,000 novel TSSs. These provide an important rich resource for close examination of the species-specific transcript structures and transcription regulations in the Rhesus macaque genome. Our approach exemplifies a relatively inexpensive way to generate a reasonably reliable TSS map for a large genome. It may serve as a guiding example for similar genome annotation efforts targeted at other model organisms.

Show MeSH
Related in: MedlinePlus