Limits...
ConPADE: genome assembly ploidy estimation from next-generation sequencing data.

Margarido GR, Heckerman D - PLoS Comput. Biol. (2015)

Bottom Line: As a result of improvements in genome assembly algorithms and the ever decreasing costs of high-throughput sequencing technologies, new high quality draft genome sequences are published at a striking pace.Given the similarity between multiple copies of a basic genome in polyploid individuals, assembly of such data usually results in collapsed contigs that represent a variable number of homoeologous genomic regions.We show that ConPADE may also be used for related applications, such as the identification of duplicated genes in fragmented assemblies, although refinements are needed.

View Article: PubMed Central - PubMed

Affiliation: Microsoft Research, Los Angeles, California, United States of America; Departamento de Genética, Escola Superior de Agricultura ''Luiz de Queiroz", Universidade de São Paulo, Piracicaba, Brazil.

ABSTRACT
As a result of improvements in genome assembly algorithms and the ever decreasing costs of high-throughput sequencing technologies, new high quality draft genome sequences are published at a striking pace. With well-established methodologies, larger and more complex genomes are being tackled, including polyploid plant genomes. Given the similarity between multiple copies of a basic genome in polyploid individuals, assembly of such data usually results in collapsed contigs that represent a variable number of homoeologous genomic regions. Unfortunately, such collapse is often not ideal, as keeping contigs separate can lead both to improved assembly and also insights about how haplotypes influence phenotype. Here, we describe a first step in avoiding inappropriate collapse during assembly. In particular, we describe ConPADE (Contig Ploidy and Allele Dosage Estimation), a probabilistic method that estimates the ploidy of any given contig/scaffold based on its allele proportions. In the process, we report findings regarding errors in sequencing. The method can be used for whole genome shotgun (WGS) sequencing data. We also show applicability of the method for variant calling and allele dosage estimation. Results for simulated and real datasets are discussed and provide evidence that ConPADE performs well as long as enough sequencing coverage is available, or the true contig ploidy is low. We show that ConPADE may also be used for related applications, such as the identification of duplicated genes in fragmented assemblies, although refinements are needed.

No MeSH data available.


Related in: MedlinePlus

Length simulation results.Color in each cell indicates the percentage of correct ploidy calls, out of 100 simulations of contigs sequenced at 50X coverage for each ploidy level.
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4400156&req=5

pcbi.1004229.g005: Length simulation results.Color in each cell indicates the percentage of correct ploidy calls, out of 100 simulations of contigs sequenced at 50X coverage for each ploidy level.

Mentions: Because a coverage level of 50X resulted in correct estimated ploidies and high dosage estimation accuracy in the previous simulation sets, while still being viable in practice for de novo genome assembly efforts, we chose this value for more detailed simulations regarding contig lengths, the results of which are shown in Fig 5 and S5 Table. For contigs of 20,000 nucleotides or longer, which in this case contain 100 informative variants on average, the full model resulted in correct ploidy estimates in every simulated dataset. For very small contigs, containing only a handful of SNPs, ploidy estimation accuracy decreased with increasing ploidy, with 60 to 70% of correct estimates for ploidies over 13. Once more, the percentage of correctly called dosages was above or close to 95%, providing evidence that, given the correct ploidy, dosage estimation with this level of coverage is accurate. False negative rates were higher for shorter contigs, due to the fact that there was lower or no read coverage on the edges of contigs (S5 Table).


ConPADE: genome assembly ploidy estimation from next-generation sequencing data.

Margarido GR, Heckerman D - PLoS Comput. Biol. (2015)

Length simulation results.Color in each cell indicates the percentage of correct ploidy calls, out of 100 simulations of contigs sequenced at 50X coverage for each ploidy level.
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4400156&req=5

pcbi.1004229.g005: Length simulation results.Color in each cell indicates the percentage of correct ploidy calls, out of 100 simulations of contigs sequenced at 50X coverage for each ploidy level.
Mentions: Because a coverage level of 50X resulted in correct estimated ploidies and high dosage estimation accuracy in the previous simulation sets, while still being viable in practice for de novo genome assembly efforts, we chose this value for more detailed simulations regarding contig lengths, the results of which are shown in Fig 5 and S5 Table. For contigs of 20,000 nucleotides or longer, which in this case contain 100 informative variants on average, the full model resulted in correct ploidy estimates in every simulated dataset. For very small contigs, containing only a handful of SNPs, ploidy estimation accuracy decreased with increasing ploidy, with 60 to 70% of correct estimates for ploidies over 13. Once more, the percentage of correctly called dosages was above or close to 95%, providing evidence that, given the correct ploidy, dosage estimation with this level of coverage is accurate. False negative rates were higher for shorter contigs, due to the fact that there was lower or no read coverage on the edges of contigs (S5 Table).

Bottom Line: As a result of improvements in genome assembly algorithms and the ever decreasing costs of high-throughput sequencing technologies, new high quality draft genome sequences are published at a striking pace.Given the similarity between multiple copies of a basic genome in polyploid individuals, assembly of such data usually results in collapsed contigs that represent a variable number of homoeologous genomic regions.We show that ConPADE may also be used for related applications, such as the identification of duplicated genes in fragmented assemblies, although refinements are needed.

View Article: PubMed Central - PubMed

Affiliation: Microsoft Research, Los Angeles, California, United States of America; Departamento de Genética, Escola Superior de Agricultura ''Luiz de Queiroz", Universidade de São Paulo, Piracicaba, Brazil.

ABSTRACT
As a result of improvements in genome assembly algorithms and the ever decreasing costs of high-throughput sequencing technologies, new high quality draft genome sequences are published at a striking pace. With well-established methodologies, larger and more complex genomes are being tackled, including polyploid plant genomes. Given the similarity between multiple copies of a basic genome in polyploid individuals, assembly of such data usually results in collapsed contigs that represent a variable number of homoeologous genomic regions. Unfortunately, such collapse is often not ideal, as keeping contigs separate can lead both to improved assembly and also insights about how haplotypes influence phenotype. Here, we describe a first step in avoiding inappropriate collapse during assembly. In particular, we describe ConPADE (Contig Ploidy and Allele Dosage Estimation), a probabilistic method that estimates the ploidy of any given contig/scaffold based on its allele proportions. In the process, we report findings regarding errors in sequencing. The method can be used for whole genome shotgun (WGS) sequencing data. We also show applicability of the method for variant calling and allele dosage estimation. Results for simulated and real datasets are discussed and provide evidence that ConPADE performs well as long as enough sequencing coverage is available, or the true contig ploidy is low. We show that ConPADE may also be used for related applications, such as the identification of duplicated genes in fragmented assemblies, although refinements are needed.

No MeSH data available.


Related in: MedlinePlus