Limits...
Extensive error in the number of genes inferred from draft genome assemblies.

Denton JF, Lugo-Martinez J, Tucker AE, Schrider DR, Warren WC, Hahn MW - PLoS Comput. Biol. (2014)

Bottom Line: These incomplete and error-filled assemblies result in many annotation errors, especially in the number of genes present in a genome.To do this, we compare multiple draft assemblies against higher-quality versions of the same genomes, using several new assemblies of the chicken genome based on both traditional and next-generation sequencing technologies, as well as published draft assemblies of chimpanzee.Finally, we demonstrate the usefulness of RNA-Seq in improving the gene annotation of draft assemblies, largely by connecting genes that have been fragmented in the assembly process.

View Article: PubMed Central - PubMed

Affiliation: School of Informatics and Computing, Indiana University, Bloomington, Indiana.

ABSTRACT
Current sequencing methods produce large amounts of data, but genome assemblies based on these data are often woefully incomplete. These incomplete and error-filled assemblies result in many annotation errors, especially in the number of genes present in a genome. In this paper we investigate the magnitude of the problem, both in terms of total gene number and the number of copies of genes in specific families. To do this, we compare multiple draft assemblies against higher-quality versions of the same genomes, using several new assemblies of the chicken genome based on both traditional and next-generation sequencing technologies, as well as published draft assemblies of chimpanzee. We find that upwards of 40% of all gene families are inferred to have the wrong number of genes in draft assemblies, and that these incorrect assemblies both add and subtract genes. Using simulated genome assemblies of Drosophila melanogaster, we find that the major cause of increased gene numbers in draft genomes is the fragmentation of genes onto multiple individual contigs. Finally, we demonstrate the usefulness of RNA-Seq in improving the gene annotation of draft assemblies, largely by connecting genes that have been fragmented in the assembly process.

Show MeSH

Related in: MedlinePlus

Differences in gene family size when comparing annotated draft genomes (see Table 1 for individual descriptions) to the chicken reference assembly (v4.0).For each gene family, the size (in total number of genes predicted) was compared to the chicken reference; positive numbers indicate an excess number of genes in the draft genome annotations, while negative numbers indicate a deficit of genes. The small number of gene families with more than +/−3 differences from the reference are not shown. Gene models were predicted using GENSCAN.
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4256071&req=5

pcbi-1003998-g002: Differences in gene family size when comparing annotated draft genomes (see Table 1 for individual descriptions) to the chicken reference assembly (v4.0).For each gene family, the size (in total number of genes predicted) was compared to the chicken reference; positive numbers indicate an excess number of genes in the draft genome annotations, while negative numbers indicate a deficit of genes. The small number of gene families with more than +/−3 differences from the reference are not shown. Gene models were predicted using GENSCAN.

Mentions: After clustering the filtered predictions into groups of homologous genes based on sequence similarity (equivalent to gene families; see Methods), we were able to compare gene family sizes in each assembly relative to the predicted sizes in the current chicken reference assembly (Fig. 2). As expected based on quality and coverage, the fosmid assembly shows the largest deviation in terms of gene family size relative to the reference chicken assembly. For each assembly no more than 60% of all gene families were the same size as in the reference assembly, meaning that the remaining 40% or more of families were inferred to have the wrong size. These gene families were either missing one or more genes relative to the reference or contained one or more additional members relative to the size of gene families inferred from the reference assembly. The fosmid assembly was a clear outlier, with more than half of all gene families missing gene copies relative to the reference.


Extensive error in the number of genes inferred from draft genome assemblies.

Denton JF, Lugo-Martinez J, Tucker AE, Schrider DR, Warren WC, Hahn MW - PLoS Comput. Biol. (2014)

Differences in gene family size when comparing annotated draft genomes (see Table 1 for individual descriptions) to the chicken reference assembly (v4.0).For each gene family, the size (in total number of genes predicted) was compared to the chicken reference; positive numbers indicate an excess number of genes in the draft genome annotations, while negative numbers indicate a deficit of genes. The small number of gene families with more than +/−3 differences from the reference are not shown. Gene models were predicted using GENSCAN.
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4256071&req=5

pcbi-1003998-g002: Differences in gene family size when comparing annotated draft genomes (see Table 1 for individual descriptions) to the chicken reference assembly (v4.0).For each gene family, the size (in total number of genes predicted) was compared to the chicken reference; positive numbers indicate an excess number of genes in the draft genome annotations, while negative numbers indicate a deficit of genes. The small number of gene families with more than +/−3 differences from the reference are not shown. Gene models were predicted using GENSCAN.
Mentions: After clustering the filtered predictions into groups of homologous genes based on sequence similarity (equivalent to gene families; see Methods), we were able to compare gene family sizes in each assembly relative to the predicted sizes in the current chicken reference assembly (Fig. 2). As expected based on quality and coverage, the fosmid assembly shows the largest deviation in terms of gene family size relative to the reference chicken assembly. For each assembly no more than 60% of all gene families were the same size as in the reference assembly, meaning that the remaining 40% or more of families were inferred to have the wrong size. These gene families were either missing one or more genes relative to the reference or contained one or more additional members relative to the size of gene families inferred from the reference assembly. The fosmid assembly was a clear outlier, with more than half of all gene families missing gene copies relative to the reference.

Bottom Line: These incomplete and error-filled assemblies result in many annotation errors, especially in the number of genes present in a genome.To do this, we compare multiple draft assemblies against higher-quality versions of the same genomes, using several new assemblies of the chicken genome based on both traditional and next-generation sequencing technologies, as well as published draft assemblies of chimpanzee.Finally, we demonstrate the usefulness of RNA-Seq in improving the gene annotation of draft assemblies, largely by connecting genes that have been fragmented in the assembly process.

View Article: PubMed Central - PubMed

Affiliation: School of Informatics and Computing, Indiana University, Bloomington, Indiana.

ABSTRACT
Current sequencing methods produce large amounts of data, but genome assemblies based on these data are often woefully incomplete. These incomplete and error-filled assemblies result in many annotation errors, especially in the number of genes present in a genome. In this paper we investigate the magnitude of the problem, both in terms of total gene number and the number of copies of genes in specific families. To do this, we compare multiple draft assemblies against higher-quality versions of the same genomes, using several new assemblies of the chicken genome based on both traditional and next-generation sequencing technologies, as well as published draft assemblies of chimpanzee. We find that upwards of 40% of all gene families are inferred to have the wrong number of genes in draft assemblies, and that these incorrect assemblies both add and subtract genes. Using simulated genome assemblies of Drosophila melanogaster, we find that the major cause of increased gene numbers in draft genomes is the fragmentation of genes onto multiple individual contigs. Finally, we demonstrate the usefulness of RNA-Seq in improving the gene annotation of draft assemblies, largely by connecting genes that have been fragmented in the assembly process.

Show MeSH
Related in: MedlinePlus