Limits...
Extensive error in the number of genes inferred from draft genome assemblies.

Denton JF, Lugo-Martinez J, Tucker AE, Schrider DR, Warren WC, Hahn MW - PLoS Comput. Biol. (2014)

Bottom Line: These incomplete and error-filled assemblies result in many annotation errors, especially in the number of genes present in a genome.To do this, we compare multiple draft assemblies against higher-quality versions of the same genomes, using several new assemblies of the chicken genome based on both traditional and next-generation sequencing technologies, as well as published draft assemblies of chimpanzee.Finally, we demonstrate the usefulness of RNA-Seq in improving the gene annotation of draft assemblies, largely by connecting genes that have been fragmented in the assembly process.

View Article: PubMed Central - PubMed

Affiliation: School of Informatics and Computing, Indiana University, Bloomington, Indiana.

ABSTRACT
Current sequencing methods produce large amounts of data, but genome assemblies based on these data are often woefully incomplete. These incomplete and error-filled assemblies result in many annotation errors, especially in the number of genes present in a genome. In this paper we investigate the magnitude of the problem, both in terms of total gene number and the number of copies of genes in specific families. To do this, we compare multiple draft assemblies against higher-quality versions of the same genomes, using several new assemblies of the chicken genome based on both traditional and next-generation sequencing technologies, as well as published draft assemblies of chimpanzee. We find that upwards of 40% of all gene families are inferred to have the wrong number of genes in draft assemblies, and that these incorrect assemblies both add and subtract genes. Using simulated genome assemblies of Drosophila melanogaster, we find that the major cause of increased gene numbers in draft genomes is the fragmentation of genes onto multiple individual contigs. Finally, we demonstrate the usefulness of RNA-Seq in improving the gene annotation of draft assemblies, largely by connecting genes that have been fragmented in the assembly process.

Show MeSH

Related in: MedlinePlus

Number of predicted genes increases with increasing genome fragmentation.Starting with the D. melanogaster reference genome (release 5.41), the sequence was cut into increasing numbers of “contigs.” GENSCAN gene predictions for each assembly are shown.
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4256071&req=5

pcbi-1003998-g004: Number of predicted genes increases with increasing genome fragmentation.Starting with the D. melanogaster reference genome (release 5.41), the sequence was cut into increasing numbers of “contigs.” GENSCAN gene predictions for each assembly are shown.

Mentions: We produced nine simulated D. melanogaster assemblies with between 707 and 17,941 contigs, and compared the number of predicted gene models in each. We again applied the GENSCAN and Fgenesh gene predictors, as well as the AUGUSTUS predictor [56] and the MAKER gene prediction pipeline [57]. As expected if fragmentation is a cause of increased gene number, the number of predicted genes in each simulated D. melanogaster assembly increased as the genomes become more fragmented (Table 2). Strikingly, in the simulated genome with 17,941 contigs—each of which has a length drawn from the distribution of contig lengths in the Daphnia pulex genome (Methods)—we find 32,025 GENSCAN-predicted genes with start and stop codons, a handful more than are present in the published Daphnia pulex genome (Fig. 4). Although the other predictors all give more modest increases in gene number with increasing fragmentation, they all predict 6,000–10,000 additional genes on the most fragmented assemblies (Table 2).


Extensive error in the number of genes inferred from draft genome assemblies.

Denton JF, Lugo-Martinez J, Tucker AE, Schrider DR, Warren WC, Hahn MW - PLoS Comput. Biol. (2014)

Number of predicted genes increases with increasing genome fragmentation.Starting with the D. melanogaster reference genome (release 5.41), the sequence was cut into increasing numbers of “contigs.” GENSCAN gene predictions for each assembly are shown.
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4256071&req=5

pcbi-1003998-g004: Number of predicted genes increases with increasing genome fragmentation.Starting with the D. melanogaster reference genome (release 5.41), the sequence was cut into increasing numbers of “contigs.” GENSCAN gene predictions for each assembly are shown.
Mentions: We produced nine simulated D. melanogaster assemblies with between 707 and 17,941 contigs, and compared the number of predicted gene models in each. We again applied the GENSCAN and Fgenesh gene predictors, as well as the AUGUSTUS predictor [56] and the MAKER gene prediction pipeline [57]. As expected if fragmentation is a cause of increased gene number, the number of predicted genes in each simulated D. melanogaster assembly increased as the genomes become more fragmented (Table 2). Strikingly, in the simulated genome with 17,941 contigs—each of which has a length drawn from the distribution of contig lengths in the Daphnia pulex genome (Methods)—we find 32,025 GENSCAN-predicted genes with start and stop codons, a handful more than are present in the published Daphnia pulex genome (Fig. 4). Although the other predictors all give more modest increases in gene number with increasing fragmentation, they all predict 6,000–10,000 additional genes on the most fragmented assemblies (Table 2).

Bottom Line: These incomplete and error-filled assemblies result in many annotation errors, especially in the number of genes present in a genome.To do this, we compare multiple draft assemblies against higher-quality versions of the same genomes, using several new assemblies of the chicken genome based on both traditional and next-generation sequencing technologies, as well as published draft assemblies of chimpanzee.Finally, we demonstrate the usefulness of RNA-Seq in improving the gene annotation of draft assemblies, largely by connecting genes that have been fragmented in the assembly process.

View Article: PubMed Central - PubMed

Affiliation: School of Informatics and Computing, Indiana University, Bloomington, Indiana.

ABSTRACT
Current sequencing methods produce large amounts of data, but genome assemblies based on these data are often woefully incomplete. These incomplete and error-filled assemblies result in many annotation errors, especially in the number of genes present in a genome. In this paper we investigate the magnitude of the problem, both in terms of total gene number and the number of copies of genes in specific families. To do this, we compare multiple draft assemblies against higher-quality versions of the same genomes, using several new assemblies of the chicken genome based on both traditional and next-generation sequencing technologies, as well as published draft assemblies of chimpanzee. We find that upwards of 40% of all gene families are inferred to have the wrong number of genes in draft assemblies, and that these incorrect assemblies both add and subtract genes. Using simulated genome assemblies of Drosophila melanogaster, we find that the major cause of increased gene numbers in draft genomes is the fragmentation of genes onto multiple individual contigs. Finally, we demonstrate the usefulness of RNA-Seq in improving the gene annotation of draft assemblies, largely by connecting genes that have been fragmented in the assembly process.

Show MeSH
Related in: MedlinePlus