Limits...
A gene-by-gene population genomics platform: de novo assembly, annotation and genealogical analysis of 108 representative Neisseria meningitidis genomes.

Bratcher HB, Corton C, Jolley KA, Parkhill J, Maiden MC - BMC Genomics (2014)

Bottom Line: The de novo assembly, combined with automated gene-by-gene annotation, generates high quality draft genomes in which the majority of protein-encoding genes are present with high accuracy.The approach catalogues diversity efficiently, permits analyses of a single genome or multiple genome comparisons, and is a practical approach to interpreting WGS data for large bacterial population samples.The method generates novel insights into the biology of the meningococcus and improves our understanding of the whole population structure, not just disease causing lineages.

View Article: PubMed Central - PubMed

Affiliation: Department of Zoology, University of Oxford, Oxford, UK. martin.maiden@zoo.ox.ac.uk.

ABSTRACT

Background: Highly parallel, 'second generation' sequencing technologies have rapidly expanded the number of bacterial whole genome sequences available for study, permitting the emergence of the discipline of population genomics. Most of these data are publically available as unassembled short-read sequence files that require extensive processing before they can be used for analysis. The provision of data in a uniform format, which can be easily assessed for quality, linked to provenance and phenotype and used for analysis, is therefore necessary.

Results: The performance of de novo short-read assembly followed by automatic annotation using the pubMLST.org Neisseria database was assessed and evaluated for 108 diverse, representative, and well-characterised Neisseria meningitidis isolates. High-quality sequences were obtained for >99% of known meningococcal genes among the de novo assembled genomes and four resequenced genomes and less than 1% of reassembled genes had sequence discrepancies or misassembled sequences. A core genome of 1600 loci, present in at least 95% of the population, was determined using the Genome Comparator tool. Genealogical relationships compatible with, but at a higher resolution than, those identified by multilocus sequence typing were obtained with core genome comparisons and ribosomal protein gene analysis which revealed a genomic structure for a number of previously described phenotypes. This unified system for cataloguing Neisseria genetic variation in the genome was implemented and used for multiple analyses and the data are publically available in the PubMLST Neisseria database.

Conclusions: The de novo assembly, combined with automated gene-by-gene annotation, generates high quality draft genomes in which the majority of protein-encoding genes are present with high accuracy. The approach catalogues diversity efficiently, permits analyses of a single genome or multiple genome comparisons, and is a practical approach to interpreting WGS data for large bacterial population samples. The method generates novel insights into the biology of the meningococcus and improves our understanding of the whole population structure, not just disease causing lineages.

Show MeSH
Location of eMLST and antigen genes within the meningococcal genome. CGView map of the Neisseria meningitidis reference genome, FAM18, showing the placement of the conventional and extended MLST loci and the 3 antigen genes (4 typing fragments) used to assess sequence accuracy of the de novo high-throughput assembly method across the genome.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC4377854&req=5

Fig1: Location of eMLST and antigen genes within the meningococcal genome. CGView map of the Neisseria meningitidis reference genome, FAM18, showing the placement of the conventional and extended MLST loci and the 3 antigen genes (4 typing fragments) used to assess sequence accuracy of the de novo high-throughput assembly method across the genome.

Mentions: The WGS assemblies were compared to previously determined dideoxy (Sanger) sequence results at the MLST [16], eMLST [27, 28], porA VR1 and VR2[29], fetA[30] and fHbp[31] loci found throughout the genome (Figure 1). There were thirty-four sequence discrepancies (1.1% of CDS) between the Illumina de novo assembled and Sanger sequenced alleles, from 20 of the 108 genomes (Table 2). The number and distribution of sequence changes found in the resequencing experiments enabled the likely reasons for the discrepancies to be identified. In the majority of cases these could be attributed to either editing or labelling problems in the original Sanger sequencing experiments with only four instances that were a direct consequence of the assembly of the short-read sequences by the Velvet algorithm. The four MLST profiles affected by trace file editing errors maintained their original clonal complex assignment; however, their sequence type (ST) was amended to a new designation as a consequence of this work. In summary, the errors in the original Sanger experiments were due to: eleven trace file editing errors; 19 samples mislabelled during Sanger sequencing; and four occurrences of Velvet mis-assembly caused by short tandem repeat (STR) regions.Figure 1


A gene-by-gene population genomics platform: de novo assembly, annotation and genealogical analysis of 108 representative Neisseria meningitidis genomes.

Bratcher HB, Corton C, Jolley KA, Parkhill J, Maiden MC - BMC Genomics (2014)

Location of eMLST and antigen genes within the meningococcal genome. CGView map of the Neisseria meningitidis reference genome, FAM18, showing the placement of the conventional and extended MLST loci and the 3 antigen genes (4 typing fragments) used to assess sequence accuracy of the de novo high-throughput assembly method across the genome.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC4377854&req=5

Fig1: Location of eMLST and antigen genes within the meningococcal genome. CGView map of the Neisseria meningitidis reference genome, FAM18, showing the placement of the conventional and extended MLST loci and the 3 antigen genes (4 typing fragments) used to assess sequence accuracy of the de novo high-throughput assembly method across the genome.
Mentions: The WGS assemblies were compared to previously determined dideoxy (Sanger) sequence results at the MLST [16], eMLST [27, 28], porA VR1 and VR2[29], fetA[30] and fHbp[31] loci found throughout the genome (Figure 1). There were thirty-four sequence discrepancies (1.1% of CDS) between the Illumina de novo assembled and Sanger sequenced alleles, from 20 of the 108 genomes (Table 2). The number and distribution of sequence changes found in the resequencing experiments enabled the likely reasons for the discrepancies to be identified. In the majority of cases these could be attributed to either editing or labelling problems in the original Sanger sequencing experiments with only four instances that were a direct consequence of the assembly of the short-read sequences by the Velvet algorithm. The four MLST profiles affected by trace file editing errors maintained their original clonal complex assignment; however, their sequence type (ST) was amended to a new designation as a consequence of this work. In summary, the errors in the original Sanger experiments were due to: eleven trace file editing errors; 19 samples mislabelled during Sanger sequencing; and four occurrences of Velvet mis-assembly caused by short tandem repeat (STR) regions.Figure 1

Bottom Line: The de novo assembly, combined with automated gene-by-gene annotation, generates high quality draft genomes in which the majority of protein-encoding genes are present with high accuracy.The approach catalogues diversity efficiently, permits analyses of a single genome or multiple genome comparisons, and is a practical approach to interpreting WGS data for large bacterial population samples.The method generates novel insights into the biology of the meningococcus and improves our understanding of the whole population structure, not just disease causing lineages.

View Article: PubMed Central - PubMed

Affiliation: Department of Zoology, University of Oxford, Oxford, UK. martin.maiden@zoo.ox.ac.uk.

ABSTRACT

Background: Highly parallel, 'second generation' sequencing technologies have rapidly expanded the number of bacterial whole genome sequences available for study, permitting the emergence of the discipline of population genomics. Most of these data are publically available as unassembled short-read sequence files that require extensive processing before they can be used for analysis. The provision of data in a uniform format, which can be easily assessed for quality, linked to provenance and phenotype and used for analysis, is therefore necessary.

Results: The performance of de novo short-read assembly followed by automatic annotation using the pubMLST.org Neisseria database was assessed and evaluated for 108 diverse, representative, and well-characterised Neisseria meningitidis isolates. High-quality sequences were obtained for >99% of known meningococcal genes among the de novo assembled genomes and four resequenced genomes and less than 1% of reassembled genes had sequence discrepancies or misassembled sequences. A core genome of 1600 loci, present in at least 95% of the population, was determined using the Genome Comparator tool. Genealogical relationships compatible with, but at a higher resolution than, those identified by multilocus sequence typing were obtained with core genome comparisons and ribosomal protein gene analysis which revealed a genomic structure for a number of previously described phenotypes. This unified system for cataloguing Neisseria genetic variation in the genome was implemented and used for multiple analyses and the data are publically available in the PubMLST Neisseria database.

Conclusions: The de novo assembly, combined with automated gene-by-gene annotation, generates high quality draft genomes in which the majority of protein-encoding genes are present with high accuracy. The approach catalogues diversity efficiently, permits analyses of a single genome or multiple genome comparisons, and is a practical approach to interpreting WGS data for large bacterial population samples. The method generates novel insights into the biology of the meningococcus and improves our understanding of the whole population structure, not just disease causing lineages.

Show MeSH