Limits...
Extensive error in the number of genes inferred from draft genome assemblies.

Denton JF, Lugo-Martinez J, Tucker AE, Schrider DR, Warren WC, Hahn MW - PLoS Comput. Biol. (2014)

Bottom Line: These incomplete and error-filled assemblies result in many annotation errors, especially in the number of genes present in a genome.To do this, we compare multiple draft assemblies against higher-quality versions of the same genomes, using several new assemblies of the chicken genome based on both traditional and next-generation sequencing technologies, as well as published draft assemblies of chimpanzee.Finally, we demonstrate the usefulness of RNA-Seq in improving the gene annotation of draft assemblies, largely by connecting genes that have been fragmented in the assembly process.

View Article: PubMed Central - PubMed

Affiliation: School of Informatics and Computing, Indiana University, Bloomington, Indiana.

ABSTRACT
Current sequencing methods produce large amounts of data, but genome assemblies based on these data are often woefully incomplete. These incomplete and error-filled assemblies result in many annotation errors, especially in the number of genes present in a genome. In this paper we investigate the magnitude of the problem, both in terms of total gene number and the number of copies of genes in specific families. To do this, we compare multiple draft assemblies against higher-quality versions of the same genomes, using several new assemblies of the chicken genome based on both traditional and next-generation sequencing technologies, as well as published draft assemblies of chimpanzee. We find that upwards of 40% of all gene families are inferred to have the wrong number of genes in draft assemblies, and that these incorrect assemblies both add and subtract genes. Using simulated genome assemblies of Drosophila melanogaster, we find that the major cause of increased gene numbers in draft genomes is the fragmentation of genes onto multiple individual contigs. Finally, we demonstrate the usefulness of RNA-Seq in improving the gene annotation of draft assemblies, largely by connecting genes that have been fragmented in the assembly process.

Show MeSH

Related in: MedlinePlus

Examples of missassembly leading to misannotation.Each row shows the true state of the genome on the left (“Expected assembly”) and a common misassembly error on the right (“Observed misassembly”). A) A single gene may be assembled as two apparently paralogous loci, increasing the predicted gene count. B) A singe gene may be fragmented into multiple pieces, each on different contigs or scaffolds. This cleavage can increase the number of predicted genes. C) Two paralogous genes may be collapsed into a single gene, decreasing the predicted gene count. D) A gene may be partially or entirely missing from the assembly, decreasing the number of predicted genes.
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4256071&req=5

pcbi-1003998-g001: Examples of missassembly leading to misannotation.Each row shows the true state of the genome on the left (“Expected assembly”) and a common misassembly error on the right (“Observed misassembly”). A) A single gene may be assembled as two apparently paralogous loci, increasing the predicted gene count. B) A singe gene may be fragmented into multiple pieces, each on different contigs or scaffolds. This cleavage can increase the number of predicted genes. C) Two paralogous genes may be collapsed into a single gene, decreasing the predicted gene count. D) A gene may be partially or entirely missing from the assembly, decreasing the number of predicted genes.

Mentions: Low-quality assemblies result in low-quality annotations [18], [27], and these annotation errors cause both the over- and under-estimation of gene numbers, e.g. [32], [33]. One cause of the over-estimation of gene numbers is the splitting of allelic variation (i.e. haplotypes present in heterozygous individuals) into separate loci (Fig. 1A); we refer to such cases as “split” genes. Split genes appear as highly similar duplicated loci within genome assemblies, and are often placed in tandem to one another or with one copy on a small scaffold by itself, e.g. [34], [35]. A second cause of the over-estimation of gene numbers is the fragmentation of a single gene onto multiple contigs or scaffolds (Fig. 1B); we refer to such cases as “cleaved” genes. Because ab initio gene predictors less likely to accurately infer gene models across sequence gaps, genes fragmented onto multiple contigs or scaffolds will be predicted as multiple separate genes, e.g. [30]. Note that gene models may also be cleaved simply because ab initio predictors have failed to join distant exons together in a single transcript, e.g. [36], [37], though this type of error may be independent of the underlying assembly quality. A common cause of the under-estimation of gene number is the collapse of truly paralogous gene copies into a single locus (Fig. 1C). This occurs because newly formed duplicates are highly similar in sequence, and therefore hard to assemble as separate loci, e.g. [30], [38]. A second cause of under-estimation is simply that genes may not be represented in low-coverage genomes due to a large number of gaps (Fig. 1D). In such cases both total gene numbers and the size of individual gene families may be severely underestimated, e.g. [39].


Extensive error in the number of genes inferred from draft genome assemblies.

Denton JF, Lugo-Martinez J, Tucker AE, Schrider DR, Warren WC, Hahn MW - PLoS Comput. Biol. (2014)

Examples of missassembly leading to misannotation.Each row shows the true state of the genome on the left (“Expected assembly”) and a common misassembly error on the right (“Observed misassembly”). A) A single gene may be assembled as two apparently paralogous loci, increasing the predicted gene count. B) A singe gene may be fragmented into multiple pieces, each on different contigs or scaffolds. This cleavage can increase the number of predicted genes. C) Two paralogous genes may be collapsed into a single gene, decreasing the predicted gene count. D) A gene may be partially or entirely missing from the assembly, decreasing the number of predicted genes.
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4256071&req=5

pcbi-1003998-g001: Examples of missassembly leading to misannotation.Each row shows the true state of the genome on the left (“Expected assembly”) and a common misassembly error on the right (“Observed misassembly”). A) A single gene may be assembled as two apparently paralogous loci, increasing the predicted gene count. B) A singe gene may be fragmented into multiple pieces, each on different contigs or scaffolds. This cleavage can increase the number of predicted genes. C) Two paralogous genes may be collapsed into a single gene, decreasing the predicted gene count. D) A gene may be partially or entirely missing from the assembly, decreasing the number of predicted genes.
Mentions: Low-quality assemblies result in low-quality annotations [18], [27], and these annotation errors cause both the over- and under-estimation of gene numbers, e.g. [32], [33]. One cause of the over-estimation of gene numbers is the splitting of allelic variation (i.e. haplotypes present in heterozygous individuals) into separate loci (Fig. 1A); we refer to such cases as “split” genes. Split genes appear as highly similar duplicated loci within genome assemblies, and are often placed in tandem to one another or with one copy on a small scaffold by itself, e.g. [34], [35]. A second cause of the over-estimation of gene numbers is the fragmentation of a single gene onto multiple contigs or scaffolds (Fig. 1B); we refer to such cases as “cleaved” genes. Because ab initio gene predictors less likely to accurately infer gene models across sequence gaps, genes fragmented onto multiple contigs or scaffolds will be predicted as multiple separate genes, e.g. [30]. Note that gene models may also be cleaved simply because ab initio predictors have failed to join distant exons together in a single transcript, e.g. [36], [37], though this type of error may be independent of the underlying assembly quality. A common cause of the under-estimation of gene number is the collapse of truly paralogous gene copies into a single locus (Fig. 1C). This occurs because newly formed duplicates are highly similar in sequence, and therefore hard to assemble as separate loci, e.g. [30], [38]. A second cause of under-estimation is simply that genes may not be represented in low-coverage genomes due to a large number of gaps (Fig. 1D). In such cases both total gene numbers and the size of individual gene families may be severely underestimated, e.g. [39].

Bottom Line: These incomplete and error-filled assemblies result in many annotation errors, especially in the number of genes present in a genome.To do this, we compare multiple draft assemblies against higher-quality versions of the same genomes, using several new assemblies of the chicken genome based on both traditional and next-generation sequencing technologies, as well as published draft assemblies of chimpanzee.Finally, we demonstrate the usefulness of RNA-Seq in improving the gene annotation of draft assemblies, largely by connecting genes that have been fragmented in the assembly process.

View Article: PubMed Central - PubMed

Affiliation: School of Informatics and Computing, Indiana University, Bloomington, Indiana.

ABSTRACT
Current sequencing methods produce large amounts of data, but genome assemblies based on these data are often woefully incomplete. These incomplete and error-filled assemblies result in many annotation errors, especially in the number of genes present in a genome. In this paper we investigate the magnitude of the problem, both in terms of total gene number and the number of copies of genes in specific families. To do this, we compare multiple draft assemblies against higher-quality versions of the same genomes, using several new assemblies of the chicken genome based on both traditional and next-generation sequencing technologies, as well as published draft assemblies of chimpanzee. We find that upwards of 40% of all gene families are inferred to have the wrong number of genes in draft assemblies, and that these incorrect assemblies both add and subtract genes. Using simulated genome assemblies of Drosophila melanogaster, we find that the major cause of increased gene numbers in draft genomes is the fragmentation of genes onto multiple individual contigs. Finally, we demonstrate the usefulness of RNA-Seq in improving the gene annotation of draft assemblies, largely by connecting genes that have been fragmented in the assembly process.

Show MeSH
Related in: MedlinePlus