Limits...
CodingQuarry: highly accurate hidden Markov model gene prediction in fungal genomes using RNA-seq transcripts.

Testa AC, Hane JK, Ellwood SR, Oliver RP - BMC Genomics (2015)

Bottom Line: In these cases, protein homology-based evidence fails to correctly annotate many genes, or significantly improve ab initio predictions.Generalised hidden Markov models (GHMM) have proven to be invaluable tools in gene annotation and, recently, RNA-seq has emerged as a cost-effective means to significantly improve the quality of automated gene annotation.These results are 4-5% better than those of AUGUSTUS, the next best performing RNA-seq driven gene predictor tested.

View Article: PubMed Central - PubMed

Affiliation: Centre for Crop and Disease Management, Department of Environment and Agriculture, School of Science, Curtin University, Bentley, WA, 6102, Australia. 13392554@student.curtin.edu.au.

ABSTRACT

Background: The impact of gene annotation quality on functional and comparative genomics makes gene prediction an important process, particularly in non-model species, including many fungi. Sets of homologous protein sequences are rarely complete with respect to the fungal species of interest and are often small or unreliable, especially when closely related species have not been sequenced or annotated in detail. In these cases, protein homology-based evidence fails to correctly annotate many genes, or significantly improve ab initio predictions. Generalised hidden Markov models (GHMM) have proven to be invaluable tools in gene annotation and, recently, RNA-seq has emerged as a cost-effective means to significantly improve the quality of automated gene annotation. As these methods do not require sets of homologous proteins, improving gene prediction from these resources is of benefit to fungal researchers. While many pipelines now incorporate RNA-seq data in training GHMMs, there has been relatively little investigation into additionally combining RNA-seq data at the point of prediction, and room for improvement in this area motivates this study.

Results: CodingQuarry is a highly accurate, self-training GHMM fungal gene predictor designed to work with assembled, aligned RNA-seq transcripts. RNA-seq data informs annotations both during gene-model training and in prediction. Our approach capitalises on the high quality of fungal transcript assemblies by incorporating predictions made directly from transcript sequences. Correct predictions are made despite transcript assembly problems, including those caused by overlap between the transcripts of adjacent gene loci. Stringent benchmarking against high-confidence annotation subsets showed CodingQuarry predicted 91.3% of Schizosaccharomyces pombe genes and 90.4% of Saccharomyces cerevisiae genes perfectly. These results are 4-5% better than those of AUGUSTUS, the next best performing RNA-seq driven gene predictor tested. Comparisons against whole genome Sc. pombe and S. cerevisiae annotations further substantiate a 4-5% improvement in the number of correctly predicted genes.

Conclusions: We demonstrate the success of a novel method of incorporating RNA-seq data into GHMM fungal gene prediction. This shows that a high quality annotation can be achieved without relying on protein homology or a training set of genes. CodingQuarry is freely available ( https://sourceforge.net/projects/codingquarry/ ), and suitable for incorporation into genome annotation pipelines.

Show MeSH
CodingQuarry flow diagram. Examples are shown of correct annotations of coding sequences, (A) and a typical CodingQuarry input; assembled transcripts aligned to the genome (B). The stages used within CodingQuarry to predict coding sequences are shown (C-G). Firstly, coding sequences are predicted from transcript sequences (introns are removed) using a GHMM (C). Possible prediction errors after this step are coloured red, and notes show how these are identified (D). These error prone predicted genes are discarded (E), and regions are selected for prediction from genome sequence (F). The resulting prediction is output by CodingQuarry (G), which merges the retained predictions from transcript sequences (E) with the predictions from selected areas of the genome sequence (F). Sections of the example genome sequence and annotations have been labelled i-x in each part of the diagram (A-G), and marked with vertical dotted lines. These sections are labelled to facilitate in-text references to the diagram in the Implementation section of this manuscript. Labels i-x correspond to the same genome sections through A-G.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC4363200&req=5

Fig1: CodingQuarry flow diagram. Examples are shown of correct annotations of coding sequences, (A) and a typical CodingQuarry input; assembled transcripts aligned to the genome (B). The stages used within CodingQuarry to predict coding sequences are shown (C-G). Firstly, coding sequences are predicted from transcript sequences (introns are removed) using a GHMM (C). Possible prediction errors after this step are coloured red, and notes show how these are identified (D). These error prone predicted genes are discarded (E), and regions are selected for prediction from genome sequence (F). The resulting prediction is output by CodingQuarry (G), which merges the retained predictions from transcript sequences (E) with the predictions from selected areas of the genome sequence (F). Sections of the example genome sequence and annotations have been labelled i-x in each part of the diagram (A-G), and marked with vertical dotted lines. These sections are labelled to facilitate in-text references to the diagram in the Implementation section of this manuscript. Labels i-x correspond to the same genome sections through A-G.

Mentions: The GHMM used in stage 1 uses fixed length states to describe the gene start and Kozak sequence [50] and gene stop codon, and variable length states to describe gene coding sequences, UTRs, and non-coding transcripts. To address the issue of merged transcripts, this model allows a single transcript sequence to contain multiple genes, via the creation of a “middle UTR” state. Where UTRs of adjacent transcripts overlap in terms of their relative corresponding positions on the genomic DNA, a single transcript sequence as derived from RNA-seq can contain multiple gene loci. A pictorial example of this is shown in Figure 1, section Bi, in which the middle UTR state is used to allow the correct prediction of two genes on the same strand within a merged transcript sequence. In the case of unstranded RNA-seq, prediction errors arising from transcript sequences merged due to S-AS UTR overlap are corrected in stage 2.Figure 1


CodingQuarry: highly accurate hidden Markov model gene prediction in fungal genomes using RNA-seq transcripts.

Testa AC, Hane JK, Ellwood SR, Oliver RP - BMC Genomics (2015)

CodingQuarry flow diagram. Examples are shown of correct annotations of coding sequences, (A) and a typical CodingQuarry input; assembled transcripts aligned to the genome (B). The stages used within CodingQuarry to predict coding sequences are shown (C-G). Firstly, coding sequences are predicted from transcript sequences (introns are removed) using a GHMM (C). Possible prediction errors after this step are coloured red, and notes show how these are identified (D). These error prone predicted genes are discarded (E), and regions are selected for prediction from genome sequence (F). The resulting prediction is output by CodingQuarry (G), which merges the retained predictions from transcript sequences (E) with the predictions from selected areas of the genome sequence (F). Sections of the example genome sequence and annotations have been labelled i-x in each part of the diagram (A-G), and marked with vertical dotted lines. These sections are labelled to facilitate in-text references to the diagram in the Implementation section of this manuscript. Labels i-x correspond to the same genome sections through A-G.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC4363200&req=5

Fig1: CodingQuarry flow diagram. Examples are shown of correct annotations of coding sequences, (A) and a typical CodingQuarry input; assembled transcripts aligned to the genome (B). The stages used within CodingQuarry to predict coding sequences are shown (C-G). Firstly, coding sequences are predicted from transcript sequences (introns are removed) using a GHMM (C). Possible prediction errors after this step are coloured red, and notes show how these are identified (D). These error prone predicted genes are discarded (E), and regions are selected for prediction from genome sequence (F). The resulting prediction is output by CodingQuarry (G), which merges the retained predictions from transcript sequences (E) with the predictions from selected areas of the genome sequence (F). Sections of the example genome sequence and annotations have been labelled i-x in each part of the diagram (A-G), and marked with vertical dotted lines. These sections are labelled to facilitate in-text references to the diagram in the Implementation section of this manuscript. Labels i-x correspond to the same genome sections through A-G.
Mentions: The GHMM used in stage 1 uses fixed length states to describe the gene start and Kozak sequence [50] and gene stop codon, and variable length states to describe gene coding sequences, UTRs, and non-coding transcripts. To address the issue of merged transcripts, this model allows a single transcript sequence to contain multiple genes, via the creation of a “middle UTR” state. Where UTRs of adjacent transcripts overlap in terms of their relative corresponding positions on the genomic DNA, a single transcript sequence as derived from RNA-seq can contain multiple gene loci. A pictorial example of this is shown in Figure 1, section Bi, in which the middle UTR state is used to allow the correct prediction of two genes on the same strand within a merged transcript sequence. In the case of unstranded RNA-seq, prediction errors arising from transcript sequences merged due to S-AS UTR overlap are corrected in stage 2.Figure 1

Bottom Line: In these cases, protein homology-based evidence fails to correctly annotate many genes, or significantly improve ab initio predictions.Generalised hidden Markov models (GHMM) have proven to be invaluable tools in gene annotation and, recently, RNA-seq has emerged as a cost-effective means to significantly improve the quality of automated gene annotation.These results are 4-5% better than those of AUGUSTUS, the next best performing RNA-seq driven gene predictor tested.

View Article: PubMed Central - PubMed

Affiliation: Centre for Crop and Disease Management, Department of Environment and Agriculture, School of Science, Curtin University, Bentley, WA, 6102, Australia. 13392554@student.curtin.edu.au.

ABSTRACT

Background: The impact of gene annotation quality on functional and comparative genomics makes gene prediction an important process, particularly in non-model species, including many fungi. Sets of homologous protein sequences are rarely complete with respect to the fungal species of interest and are often small or unreliable, especially when closely related species have not been sequenced or annotated in detail. In these cases, protein homology-based evidence fails to correctly annotate many genes, or significantly improve ab initio predictions. Generalised hidden Markov models (GHMM) have proven to be invaluable tools in gene annotation and, recently, RNA-seq has emerged as a cost-effective means to significantly improve the quality of automated gene annotation. As these methods do not require sets of homologous proteins, improving gene prediction from these resources is of benefit to fungal researchers. While many pipelines now incorporate RNA-seq data in training GHMMs, there has been relatively little investigation into additionally combining RNA-seq data at the point of prediction, and room for improvement in this area motivates this study.

Results: CodingQuarry is a highly accurate, self-training GHMM fungal gene predictor designed to work with assembled, aligned RNA-seq transcripts. RNA-seq data informs annotations both during gene-model training and in prediction. Our approach capitalises on the high quality of fungal transcript assemblies by incorporating predictions made directly from transcript sequences. Correct predictions are made despite transcript assembly problems, including those caused by overlap between the transcripts of adjacent gene loci. Stringent benchmarking against high-confidence annotation subsets showed CodingQuarry predicted 91.3% of Schizosaccharomyces pombe genes and 90.4% of Saccharomyces cerevisiae genes perfectly. These results are 4-5% better than those of AUGUSTUS, the next best performing RNA-seq driven gene predictor tested. Comparisons against whole genome Sc. pombe and S. cerevisiae annotations further substantiate a 4-5% improvement in the number of correctly predicted genes.

Conclusions: We demonstrate the success of a novel method of incorporating RNA-seq data into GHMM fungal gene prediction. This shows that a high quality annotation can be achieved without relying on protein homology or a training set of genes. CodingQuarry is freely available ( https://sourceforge.net/projects/codingquarry/ ), and suitable for incorporation into genome annotation pipelines.

Show MeSH