Limits...
CodingQuarry: highly accurate hidden Markov model gene prediction in fungal genomes using RNA-seq transcripts.

Testa AC, Hane JK, Ellwood SR, Oliver RP - BMC Genomics (2015)

Bottom Line: Generalised hidden Markov models (GHMM) have proven to be invaluable tools in gene annotation and, recently, RNA-seq has emerged as a cost-effective means to significantly improve the quality of automated gene annotation.As these methods do not require sets of homologous proteins, improving gene prediction from these resources is of benefit to fungal researchers.These results are 4-5% better than those of AUGUSTUS, the next best performing RNA-seq driven gene predictor tested.

View Article: PubMed Central - PubMed

Affiliation: Centre for Crop and Disease Management, Department of Environment and Agriculture, School of Science, Curtin University, Bentley, WA, 6102, Australia. 13392554@student.curtin.edu.au.

ABSTRACT

Background: The impact of gene annotation quality on functional and comparative genomics makes gene prediction an important process, particularly in non-model species, including many fungi. Sets of homologous protein sequences are rarely complete with respect to the fungal species of interest and are often small or unreliable, especially when closely related species have not been sequenced or annotated in detail. In these cases, protein homology-based evidence fails to correctly annotate many genes, or significantly improve ab initio predictions. Generalised hidden Markov models (GHMM) have proven to be invaluable tools in gene annotation and, recently, RNA-seq has emerged as a cost-effective means to significantly improve the quality of automated gene annotation. As these methods do not require sets of homologous proteins, improving gene prediction from these resources is of benefit to fungal researchers. While many pipelines now incorporate RNA-seq data in training GHMMs, there has been relatively little investigation into additionally combining RNA-seq data at the point of prediction, and room for improvement in this area motivates this study.

Results: CodingQuarry is a highly accurate, self-training GHMM fungal gene predictor designed to work with assembled, aligned RNA-seq transcripts. RNA-seq data informs annotations both during gene-model training and in prediction. Our approach capitalises on the high quality of fungal transcript assemblies by incorporating predictions made directly from transcript sequences. Correct predictions are made despite transcript assembly problems, including those caused by overlap between the transcripts of adjacent gene loci. Stringent benchmarking against high-confidence annotation subsets showed CodingQuarry predicted 91.3% of Schizosaccharomyces pombe genes and 90.4% of Saccharomyces cerevisiae genes perfectly. These results are 4-5% better than those of AUGUSTUS, the next best performing RNA-seq driven gene predictor tested. Comparisons against whole genome Sc. pombe and S. cerevisiae annotations further substantiate a 4-5% improvement in the number of correctly predicted genes.

Conclusions: We demonstrate the success of a novel method of incorporating RNA-seq data into GHMM fungal gene prediction. This shows that a high quality annotation can be achieved without relying on protein homology or a training set of genes. CodingQuarry is freely available ( https://sourceforge.net/projects/codingquarry/ ), and suitable for incorporation into genome annotation pipelines.

Show MeSH
Changes in CodingQuarry prediction accuracy at various stages of prediction ofSc. pombegenes. The gene-level sensitivity and specificity is shown at various stages (See Figure 1 and Methods) within a CodingQuarry run. Results show comparisons with Sc. pombe where A) (left-hand panel) RNA-seq data strand information was used and B) (right-hand panel) strand information was ignored. Longest ORF is the initial training set, found by taking the longest open reading frame in each transcript to be a gene, stage 1 predictions are made from transcript sequences, stage 2 adds to and replaces some of stage 1 predictions by predicting from genome sequence. Filtering of likely false-positive genes (see Implementation section) takes place before a set of predicted genes is output as the “final output”. This output is the annotation generated by CodingQuarry.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC4363200&req=5

Fig2: Changes in CodingQuarry prediction accuracy at various stages of prediction ofSc. pombegenes. The gene-level sensitivity and specificity is shown at various stages (See Figure 1 and Methods) within a CodingQuarry run. Results show comparisons with Sc. pombe where A) (left-hand panel) RNA-seq data strand information was used and B) (right-hand panel) strand information was ignored. Longest ORF is the initial training set, found by taking the longest open reading frame in each transcript to be a gene, stage 1 predictions are made from transcript sequences, stage 2 adds to and replaces some of stage 1 predictions by predicting from genome sequence. Filtering of likely false-positive genes (see Implementation section) takes place before a set of predicted genes is output as the “final output”. This output is the annotation generated by CodingQuarry.

Mentions: As explained in the methods section, the predictions made by CodingQuarry are a combination of predictions from transcript sequences (stage 1), and predictions made from genome sequence (stage 2). A filtering step then removes genes likely to be false-positive predictions. The gene-level sensitivity and specificity of CodingQuarry, when compared to full Sc. pombe datasets, after each of these stages is displayed in Figure 2A. Figure 2A shows that the initial step of creating a training set using the longest ORF in each transcript has low values of sensitivity and specificity. An ~8% gene-level sensitivity and ~6% specificity improvement to predictions is made in stage 1, where these annotations are replaced by GHMM predicted genes. Part of the reason for this is that during stage 1, multiple genes predictions are allowed to be made within a single transcript, allowing a large number of genes residing in incorrectly “merged” transcripts to still be predicted. The second prediction stage again results in a jump in prediction accuracy, this time improving the gene-level sensitivity by ~8% and specificity by ~2%. This is due to the addition of genes predicted ab initio in regions without RNA-seq transcript coverage and the prediction of genes in regions where the transcript assembly is incomplete. Single-exon genes are also re-predicted stage 2. The final filtering step gives the final output CodingQuarry prediction. This step serves to improve specificity via the removal of false-positive genes, and therefore had little effect of the gene-level sensitivity (Figure 2A).Figure 2


CodingQuarry: highly accurate hidden Markov model gene prediction in fungal genomes using RNA-seq transcripts.

Testa AC, Hane JK, Ellwood SR, Oliver RP - BMC Genomics (2015)

Changes in CodingQuarry prediction accuracy at various stages of prediction ofSc. pombegenes. The gene-level sensitivity and specificity is shown at various stages (See Figure 1 and Methods) within a CodingQuarry run. Results show comparisons with Sc. pombe where A) (left-hand panel) RNA-seq data strand information was used and B) (right-hand panel) strand information was ignored. Longest ORF is the initial training set, found by taking the longest open reading frame in each transcript to be a gene, stage 1 predictions are made from transcript sequences, stage 2 adds to and replaces some of stage 1 predictions by predicting from genome sequence. Filtering of likely false-positive genes (see Implementation section) takes place before a set of predicted genes is output as the “final output”. This output is the annotation generated by CodingQuarry.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC4363200&req=5

Fig2: Changes in CodingQuarry prediction accuracy at various stages of prediction ofSc. pombegenes. The gene-level sensitivity and specificity is shown at various stages (See Figure 1 and Methods) within a CodingQuarry run. Results show comparisons with Sc. pombe where A) (left-hand panel) RNA-seq data strand information was used and B) (right-hand panel) strand information was ignored. Longest ORF is the initial training set, found by taking the longest open reading frame in each transcript to be a gene, stage 1 predictions are made from transcript sequences, stage 2 adds to and replaces some of stage 1 predictions by predicting from genome sequence. Filtering of likely false-positive genes (see Implementation section) takes place before a set of predicted genes is output as the “final output”. This output is the annotation generated by CodingQuarry.
Mentions: As explained in the methods section, the predictions made by CodingQuarry are a combination of predictions from transcript sequences (stage 1), and predictions made from genome sequence (stage 2). A filtering step then removes genes likely to be false-positive predictions. The gene-level sensitivity and specificity of CodingQuarry, when compared to full Sc. pombe datasets, after each of these stages is displayed in Figure 2A. Figure 2A shows that the initial step of creating a training set using the longest ORF in each transcript has low values of sensitivity and specificity. An ~8% gene-level sensitivity and ~6% specificity improvement to predictions is made in stage 1, where these annotations are replaced by GHMM predicted genes. Part of the reason for this is that during stage 1, multiple genes predictions are allowed to be made within a single transcript, allowing a large number of genes residing in incorrectly “merged” transcripts to still be predicted. The second prediction stage again results in a jump in prediction accuracy, this time improving the gene-level sensitivity by ~8% and specificity by ~2%. This is due to the addition of genes predicted ab initio in regions without RNA-seq transcript coverage and the prediction of genes in regions where the transcript assembly is incomplete. Single-exon genes are also re-predicted stage 2. The final filtering step gives the final output CodingQuarry prediction. This step serves to improve specificity via the removal of false-positive genes, and therefore had little effect of the gene-level sensitivity (Figure 2A).Figure 2

Bottom Line: Generalised hidden Markov models (GHMM) have proven to be invaluable tools in gene annotation and, recently, RNA-seq has emerged as a cost-effective means to significantly improve the quality of automated gene annotation.As these methods do not require sets of homologous proteins, improving gene prediction from these resources is of benefit to fungal researchers.These results are 4-5% better than those of AUGUSTUS, the next best performing RNA-seq driven gene predictor tested.

View Article: PubMed Central - PubMed

Affiliation: Centre for Crop and Disease Management, Department of Environment and Agriculture, School of Science, Curtin University, Bentley, WA, 6102, Australia. 13392554@student.curtin.edu.au.

ABSTRACT

Background: The impact of gene annotation quality on functional and comparative genomics makes gene prediction an important process, particularly in non-model species, including many fungi. Sets of homologous protein sequences are rarely complete with respect to the fungal species of interest and are often small or unreliable, especially when closely related species have not been sequenced or annotated in detail. In these cases, protein homology-based evidence fails to correctly annotate many genes, or significantly improve ab initio predictions. Generalised hidden Markov models (GHMM) have proven to be invaluable tools in gene annotation and, recently, RNA-seq has emerged as a cost-effective means to significantly improve the quality of automated gene annotation. As these methods do not require sets of homologous proteins, improving gene prediction from these resources is of benefit to fungal researchers. While many pipelines now incorporate RNA-seq data in training GHMMs, there has been relatively little investigation into additionally combining RNA-seq data at the point of prediction, and room for improvement in this area motivates this study.

Results: CodingQuarry is a highly accurate, self-training GHMM fungal gene predictor designed to work with assembled, aligned RNA-seq transcripts. RNA-seq data informs annotations both during gene-model training and in prediction. Our approach capitalises on the high quality of fungal transcript assemblies by incorporating predictions made directly from transcript sequences. Correct predictions are made despite transcript assembly problems, including those caused by overlap between the transcripts of adjacent gene loci. Stringent benchmarking against high-confidence annotation subsets showed CodingQuarry predicted 91.3% of Schizosaccharomyces pombe genes and 90.4% of Saccharomyces cerevisiae genes perfectly. These results are 4-5% better than those of AUGUSTUS, the next best performing RNA-seq driven gene predictor tested. Comparisons against whole genome Sc. pombe and S. cerevisiae annotations further substantiate a 4-5% improvement in the number of correctly predicted genes.

Conclusions: We demonstrate the success of a novel method of incorporating RNA-seq data into GHMM fungal gene prediction. This shows that a high quality annotation can be achieved without relying on protein homology or a training set of genes. CodingQuarry is freely available ( https://sourceforge.net/projects/codingquarry/ ), and suitable for incorporation into genome annotation pipelines.

Show MeSH