Limits...
Meta-IDBA: a de Novo assembler for metagenomic data.

Peng Y, Leung HC, Yiu SM, Chin FY - Bioinformatics (2011)

Bottom Line: It first tries to partition the de Bruijn graph into isolated components of different species based on an important observation.Then, for each component, it captures the slight variants of the genomes of subspecies from the same species by multiple alignments and represents the genome of one species, using a consensus sequence.Comparison of the performances of Meta-IDBA and existing assemblers, such as Velvet and Abyss for different metagenomic datasets shows that Meta-IDBA can reconstruct longer contigs with similar accuracy.

View Article: PubMed Central - PubMed

Affiliation: Department of Computer Science, The University of Hong Kong, Hong Kong.

ABSTRACT

Motivation: Next-generation sequencing techniques allow us to generate reads from a microbial environment in order to analyze the microbial community. However, assembling of a set of mixed reads from different species to form contigs is a bottleneck of metagenomic research. Although there are many assemblers for assembling reads from a single genome, there are no assemblers for assembling reads in metagenomic data without reference genome sequences. Moreover, the performances of these assemblers on metagenomic data are far from satisfactory, because of the existence of common regions in the genomes of subspecies and species, which make the assembly problem much more complicated.

Results: We introduce the Meta-IDBA algorithm for assembling reads in metagenomic data, which contain multiple genomes from different species. There are two core steps in Meta-IDBA. It first tries to partition the de Bruijn graph into isolated components of different species based on an important observation. Then, for each component, it captures the slight variants of the genomes of subspecies from the same species by multiple alignments and represents the genome of one species, using a consensus sequence. Comparison of the performances of Meta-IDBA and existing assemblers, such as Velvet and Abyss for different metagenomic datasets shows that Meta-IDBA can reconstruct longer contigs with similar accuracy.

Availability: Meta-IDBA toolkit is available at our website http://www.cs.hku.hk/~alse/metaidba.

Contact: chin@cs.hku.hk.

Show MeSH
A component in de Bruijn graph of five E.coli subspecies.
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3117360&req=5

Figure 1: A component in de Bruijn graph of five E.coli subspecies.

Mentions: Some assemblers resolve branches by merging similar sequences as bubbles into one sequence. A bubble is defined as several similar paths with the same start vertex and the same end vertex (Zerbino and Birney, 2008) in the de Bruijn graph. Bubble merging helps to merge similar regions and reduce complexity of the de Bruijn graph. An important assumption used by assemblers to remove bubbles for single genome assembly is that the bubble is caused by a few single nucleotide polymorphisms (SNP) or errors in reads; thus, the simple paths inside a bubble are very similar, except for a few nucleotides. However, the ‘bubbles’ found in the graph for metagenomic dataset do not follow this assumption. Different bubbles mix together to make the start vertex and the end vertex very difficult to be identified. Some of these bubbles are formed by a mixture of sp- and cr-branches. Figure 1 shows an example of this phenomenon in which every simple path is contracted into a vertex for visualization. All branches at a vertex in this graph normally lead to some other vertex in the same component, but it is uncertain that these are bubbles for merging. If we look closer at these bubbles, even if the bubble is formed only by sp-branches (because of variations in subspecies), the multiple paths inside the bubble may differ a lot (maybe with larger insertion/deletion). Existing approaches for merging bubbles for single genome assembly do not work for this case; thus, they will fail to resolve these ‘bubbles’ and are unable to construct long contigs. Even if all bubbles can be identified, it is not easy to merge them together to form a consensus.Fig. 1.


Meta-IDBA: a de Novo assembler for metagenomic data.

Peng Y, Leung HC, Yiu SM, Chin FY - Bioinformatics (2011)

A component in de Bruijn graph of five E.coli subspecies.
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3117360&req=5

Figure 1: A component in de Bruijn graph of five E.coli subspecies.
Mentions: Some assemblers resolve branches by merging similar sequences as bubbles into one sequence. A bubble is defined as several similar paths with the same start vertex and the same end vertex (Zerbino and Birney, 2008) in the de Bruijn graph. Bubble merging helps to merge similar regions and reduce complexity of the de Bruijn graph. An important assumption used by assemblers to remove bubbles for single genome assembly is that the bubble is caused by a few single nucleotide polymorphisms (SNP) or errors in reads; thus, the simple paths inside a bubble are very similar, except for a few nucleotides. However, the ‘bubbles’ found in the graph for metagenomic dataset do not follow this assumption. Different bubbles mix together to make the start vertex and the end vertex very difficult to be identified. Some of these bubbles are formed by a mixture of sp- and cr-branches. Figure 1 shows an example of this phenomenon in which every simple path is contracted into a vertex for visualization. All branches at a vertex in this graph normally lead to some other vertex in the same component, but it is uncertain that these are bubbles for merging. If we look closer at these bubbles, even if the bubble is formed only by sp-branches (because of variations in subspecies), the multiple paths inside the bubble may differ a lot (maybe with larger insertion/deletion). Existing approaches for merging bubbles for single genome assembly do not work for this case; thus, they will fail to resolve these ‘bubbles’ and are unable to construct long contigs. Even if all bubbles can be identified, it is not easy to merge them together to form a consensus.Fig. 1.

Bottom Line: It first tries to partition the de Bruijn graph into isolated components of different species based on an important observation.Then, for each component, it captures the slight variants of the genomes of subspecies from the same species by multiple alignments and represents the genome of one species, using a consensus sequence.Comparison of the performances of Meta-IDBA and existing assemblers, such as Velvet and Abyss for different metagenomic datasets shows that Meta-IDBA can reconstruct longer contigs with similar accuracy.

View Article: PubMed Central - PubMed

Affiliation: Department of Computer Science, The University of Hong Kong, Hong Kong.

ABSTRACT

Motivation: Next-generation sequencing techniques allow us to generate reads from a microbial environment in order to analyze the microbial community. However, assembling of a set of mixed reads from different species to form contigs is a bottleneck of metagenomic research. Although there are many assemblers for assembling reads from a single genome, there are no assemblers for assembling reads in metagenomic data without reference genome sequences. Moreover, the performances of these assemblers on metagenomic data are far from satisfactory, because of the existence of common regions in the genomes of subspecies and species, which make the assembly problem much more complicated.

Results: We introduce the Meta-IDBA algorithm for assembling reads in metagenomic data, which contain multiple genomes from different species. There are two core steps in Meta-IDBA. It first tries to partition the de Bruijn graph into isolated components of different species based on an important observation. Then, for each component, it captures the slight variants of the genomes of subspecies from the same species by multiple alignments and represents the genome of one species, using a consensus sequence. Comparison of the performances of Meta-IDBA and existing assemblers, such as Velvet and Abyss for different metagenomic datasets shows that Meta-IDBA can reconstruct longer contigs with similar accuracy.

Availability: Meta-IDBA toolkit is available at our website http://www.cs.hku.hk/~alse/metaidba.

Contact: chin@cs.hku.hk.

Show MeSH