Limits...
Extensive error in the number of genes inferred from draft genome assemblies.

Denton JF, Lugo-Martinez J, Tucker AE, Schrider DR, Warren WC, Hahn MW - PLoS Comput. Biol. (2014)

Bottom Line: These incomplete and error-filled assemblies result in many annotation errors, especially in the number of genes present in a genome.To do this, we compare multiple draft assemblies against higher-quality versions of the same genomes, using several new assemblies of the chicken genome based on both traditional and next-generation sequencing technologies, as well as published draft assemblies of chimpanzee.Finally, we demonstrate the usefulness of RNA-Seq in improving the gene annotation of draft assemblies, largely by connecting genes that have been fragmented in the assembly process.

View Article: PubMed Central - PubMed

Affiliation: School of Informatics and Computing, Indiana University, Bloomington, Indiana.

ABSTRACT
Current sequencing methods produce large amounts of data, but genome assemblies based on these data are often woefully incomplete. These incomplete and error-filled assemblies result in many annotation errors, especially in the number of genes present in a genome. In this paper we investigate the magnitude of the problem, both in terms of total gene number and the number of copies of genes in specific families. To do this, we compare multiple draft assemblies against higher-quality versions of the same genomes, using several new assemblies of the chicken genome based on both traditional and next-generation sequencing technologies, as well as published draft assemblies of chimpanzee. We find that upwards of 40% of all gene families are inferred to have the wrong number of genes in draft assemblies, and that these incorrect assemblies both add and subtract genes. Using simulated genome assemblies of Drosophila melanogaster, we find that the major cause of increased gene numbers in draft genomes is the fragmentation of genes onto multiple individual contigs. Finally, we demonstrate the usefulness of RNA-Seq in improving the gene annotation of draft assemblies, largely by connecting genes that have been fragmented in the assembly process.

Show MeSH

Related in: MedlinePlus

Number of predicted exons per gene decreases with increased genome fragmentation.A comparison of the number of predicted exons per gene in the uncut D. melanogaster reference genome to the fragmented version of this genome that contains 17,941 contigs (the right-most point in Fig. 4). Gene models were predicted using GENSCAN.
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4256071&req=5

pcbi-1003998-g005: Number of predicted exons per gene decreases with increased genome fragmentation.A comparison of the number of predicted exons per gene in the uncut D. melanogaster reference genome to the fragmented version of this genome that contains 17,941 contigs (the right-most point in Fig. 4). Gene models were predicted using GENSCAN.

Mentions: When we examined specific genes in our prediction sets we often found them to be cleaved, sometimes into multiple pieces, with single exons or groups of exons isolated on individual contigs. Gene prediction software will often call these exons as genes, and the process of gene prediction in these highly fragmented genomes has essentially become one of exon prediction. Zhang et al.[40] found similar instances of spurious gene calls from cleaved or isolated exons when looking at the draft rhesus macaque assembly and annotation (see [58] for examples from the pig genome). Although these random cleavages of the Drosophila genome may not be a perfect approximation of fragmentation in real assemblies, the increasing fragmentation causes the number of exons per gene in the predicted sets to decline. Comparing the number of exons per gene in the simulated dataset to the number in the reference D. melanogaster genome, we see a huge enrichment for single-exon genes and a general decline in the average number of exons (Fig. 5). Due to the highly fragmented nature of this assembly almost none of the genes with over a dozen exons have remained full-length, and the longest genes have often been cleaved into more than two predicted genes.


Extensive error in the number of genes inferred from draft genome assemblies.

Denton JF, Lugo-Martinez J, Tucker AE, Schrider DR, Warren WC, Hahn MW - PLoS Comput. Biol. (2014)

Number of predicted exons per gene decreases with increased genome fragmentation.A comparison of the number of predicted exons per gene in the uncut D. melanogaster reference genome to the fragmented version of this genome that contains 17,941 contigs (the right-most point in Fig. 4). Gene models were predicted using GENSCAN.
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4256071&req=5

pcbi-1003998-g005: Number of predicted exons per gene decreases with increased genome fragmentation.A comparison of the number of predicted exons per gene in the uncut D. melanogaster reference genome to the fragmented version of this genome that contains 17,941 contigs (the right-most point in Fig. 4). Gene models were predicted using GENSCAN.
Mentions: When we examined specific genes in our prediction sets we often found them to be cleaved, sometimes into multiple pieces, with single exons or groups of exons isolated on individual contigs. Gene prediction software will often call these exons as genes, and the process of gene prediction in these highly fragmented genomes has essentially become one of exon prediction. Zhang et al.[40] found similar instances of spurious gene calls from cleaved or isolated exons when looking at the draft rhesus macaque assembly and annotation (see [58] for examples from the pig genome). Although these random cleavages of the Drosophila genome may not be a perfect approximation of fragmentation in real assemblies, the increasing fragmentation causes the number of exons per gene in the predicted sets to decline. Comparing the number of exons per gene in the simulated dataset to the number in the reference D. melanogaster genome, we see a huge enrichment for single-exon genes and a general decline in the average number of exons (Fig. 5). Due to the highly fragmented nature of this assembly almost none of the genes with over a dozen exons have remained full-length, and the longest genes have often been cleaved into more than two predicted genes.

Bottom Line: These incomplete and error-filled assemblies result in many annotation errors, especially in the number of genes present in a genome.To do this, we compare multiple draft assemblies against higher-quality versions of the same genomes, using several new assemblies of the chicken genome based on both traditional and next-generation sequencing technologies, as well as published draft assemblies of chimpanzee.Finally, we demonstrate the usefulness of RNA-Seq in improving the gene annotation of draft assemblies, largely by connecting genes that have been fragmented in the assembly process.

View Article: PubMed Central - PubMed

Affiliation: School of Informatics and Computing, Indiana University, Bloomington, Indiana.

ABSTRACT
Current sequencing methods produce large amounts of data, but genome assemblies based on these data are often woefully incomplete. These incomplete and error-filled assemblies result in many annotation errors, especially in the number of genes present in a genome. In this paper we investigate the magnitude of the problem, both in terms of total gene number and the number of copies of genes in specific families. To do this, we compare multiple draft assemblies against higher-quality versions of the same genomes, using several new assemblies of the chicken genome based on both traditional and next-generation sequencing technologies, as well as published draft assemblies of chimpanzee. We find that upwards of 40% of all gene families are inferred to have the wrong number of genes in draft assemblies, and that these incorrect assemblies both add and subtract genes. Using simulated genome assemblies of Drosophila melanogaster, we find that the major cause of increased gene numbers in draft genomes is the fragmentation of genes onto multiple individual contigs. Finally, we demonstrate the usefulness of RNA-Seq in improving the gene annotation of draft assemblies, largely by connecting genes that have been fragmented in the assembly process.

Show MeSH
Related in: MedlinePlus