Limits...
Metassembler: merging and optimizing de novo genome assemblies.

Wences AH, Schatz MC - Genome Biol. (2015)

Bottom Line: Genome assembly projects typically run multiple algorithms in an attempt to find the single best assembly, although those assemblies often have complementary, if untapped, strengths and weaknesses.We apply it to the four genomes from the Assemblathon competitions and show it consistently and substantially improves the contiguity and quality of each assembly.We also develop guidelines for meta-assembly by systematically evaluating 120 permutations of merging the top 5 assemblies of the first Assemblathon competition.

View Article: PubMed Central - PubMed

Affiliation: Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA. alhernan@cshl.edu.

ABSTRACT
Genome assembly projects typically run multiple algorithms in an attempt to find the single best assembly, although those assemblies often have complementary, if untapped, strengths and weaknesses. We present our metassembler algorithm that merges multiple assemblies of a genome into a single superior sequence. We apply it to the four genomes from the Assemblathon competitions and show it consistently and substantially improves the contiguity and quality of each assembly. We also develop guidelines for meta-assembly by systematically evaluating 120 permutations of merging the top 5 assemblies of the first Assemblathon competition. The software is open-source at http://metassembler.sourceforge.net .

No MeSH data available.


Related in: MedlinePlus

Assemblathon 1 metassembly accuracy. Assembly contiguity and accuracy metrics are plotted at each merging step for all possible permutations of the five input assemblies: scaffold N50 (a), corrected contig N50 (b), duplicated reference bases (c), deleted reference bases (d), translocations (e), and relocations (f). For all plots, the x-axis represents the number of input assemblies being metassembled, with 1 being the starting assembly. The two horizontal red lines mark the final maximum and minimum value of the metric across all permutations. Most of the permutations are plotted in gray, while permutations of particular note are plotted with different colors: the pink line represents the permutation that has the maximum value in the final metassembly while the dark blue line represents the permutation with the minimum value. Also, the green line represents the permutation resulting from ordering the input assemblies by the overall rank reported in the Assemblathon 1 paper (Broad-BGI-WTSI-DOEJGI-CSHL), the light blue line represents the permutation obtained by ordering the input assemblies by scaffold N50 size (DOEJGI-Broad-WTSI-CSHL-BGI) while the brown line represents the order by contig N50 size (BGI-Broad-CSHL-WTSI-DOEJGI). Comp Ref Bases compressed reference bases, Dup Ref Bases duplicated reference bases
© Copyright Policy - OpenAccess
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC4581417&req=5

Fig1: Assemblathon 1 metassembly accuracy. Assembly contiguity and accuracy metrics are plotted at each merging step for all possible permutations of the five input assemblies: scaffold N50 (a), corrected contig N50 (b), duplicated reference bases (c), deleted reference bases (d), translocations (e), and relocations (f). For all plots, the x-axis represents the number of input assemblies being metassembled, with 1 being the starting assembly. The two horizontal red lines mark the final maximum and minimum value of the metric across all permutations. Most of the permutations are plotted in gray, while permutations of particular note are plotted with different colors: the pink line represents the permutation that has the maximum value in the final metassembly while the dark blue line represents the permutation with the minimum value. Also, the green line represents the permutation resulting from ordering the input assemblies by the overall rank reported in the Assemblathon 1 paper (Broad-BGI-WTSI-DOEJGI-CSHL), the light blue line represents the permutation obtained by ordering the input assemblies by scaffold N50 size (DOEJGI-Broad-WTSI-CSHL-BGI) while the brown line represents the order by contig N50 size (BGI-Broad-CSHL-WTSI-DOEJGI). Comp Ref Bases compressed reference bases, Dup Ref Bases duplicated reference bases

Mentions: We systematically metassembled all 120 possible permutations of the five input assemblies, using the 2.5-kbp mate-pair library to evaluate the CE status of each assembly (Fig. 1; Table S1, Figure S1, and Note S1 in Additional file 1). Because the genome has an exact reference available, we were able to compute ten different quality metrics, including the number of major structural errors and the corrected scaffold and contig N50 sizes, using the GAGE assembly evaluation tool. The corrected scaffold N50 size and corrected contig N50 size are computed by splitting the input sequences at places where significant errors are found relative to the reference assembly, and then computing the N50 sizes of the remaining sequences.Fig. 1


Metassembler: merging and optimizing de novo genome assemblies.

Wences AH, Schatz MC - Genome Biol. (2015)

Assemblathon 1 metassembly accuracy. Assembly contiguity and accuracy metrics are plotted at each merging step for all possible permutations of the five input assemblies: scaffold N50 (a), corrected contig N50 (b), duplicated reference bases (c), deleted reference bases (d), translocations (e), and relocations (f). For all plots, the x-axis represents the number of input assemblies being metassembled, with 1 being the starting assembly. The two horizontal red lines mark the final maximum and minimum value of the metric across all permutations. Most of the permutations are plotted in gray, while permutations of particular note are plotted with different colors: the pink line represents the permutation that has the maximum value in the final metassembly while the dark blue line represents the permutation with the minimum value. Also, the green line represents the permutation resulting from ordering the input assemblies by the overall rank reported in the Assemblathon 1 paper (Broad-BGI-WTSI-DOEJGI-CSHL), the light blue line represents the permutation obtained by ordering the input assemblies by scaffold N50 size (DOEJGI-Broad-WTSI-CSHL-BGI) while the brown line represents the order by contig N50 size (BGI-Broad-CSHL-WTSI-DOEJGI). Comp Ref Bases compressed reference bases, Dup Ref Bases duplicated reference bases
© Copyright Policy - OpenAccess
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC4581417&req=5

Fig1: Assemblathon 1 metassembly accuracy. Assembly contiguity and accuracy metrics are plotted at each merging step for all possible permutations of the five input assemblies: scaffold N50 (a), corrected contig N50 (b), duplicated reference bases (c), deleted reference bases (d), translocations (e), and relocations (f). For all plots, the x-axis represents the number of input assemblies being metassembled, with 1 being the starting assembly. The two horizontal red lines mark the final maximum and minimum value of the metric across all permutations. Most of the permutations are plotted in gray, while permutations of particular note are plotted with different colors: the pink line represents the permutation that has the maximum value in the final metassembly while the dark blue line represents the permutation with the minimum value. Also, the green line represents the permutation resulting from ordering the input assemblies by the overall rank reported in the Assemblathon 1 paper (Broad-BGI-WTSI-DOEJGI-CSHL), the light blue line represents the permutation obtained by ordering the input assemblies by scaffold N50 size (DOEJGI-Broad-WTSI-CSHL-BGI) while the brown line represents the order by contig N50 size (BGI-Broad-CSHL-WTSI-DOEJGI). Comp Ref Bases compressed reference bases, Dup Ref Bases duplicated reference bases
Mentions: We systematically metassembled all 120 possible permutations of the five input assemblies, using the 2.5-kbp mate-pair library to evaluate the CE status of each assembly (Fig. 1; Table S1, Figure S1, and Note S1 in Additional file 1). Because the genome has an exact reference available, we were able to compute ten different quality metrics, including the number of major structural errors and the corrected scaffold and contig N50 sizes, using the GAGE assembly evaluation tool. The corrected scaffold N50 size and corrected contig N50 size are computed by splitting the input sequences at places where significant errors are found relative to the reference assembly, and then computing the N50 sizes of the remaining sequences.Fig. 1

Bottom Line: Genome assembly projects typically run multiple algorithms in an attempt to find the single best assembly, although those assemblies often have complementary, if untapped, strengths and weaknesses.We apply it to the four genomes from the Assemblathon competitions and show it consistently and substantially improves the contiguity and quality of each assembly.We also develop guidelines for meta-assembly by systematically evaluating 120 permutations of merging the top 5 assemblies of the first Assemblathon competition.

View Article: PubMed Central - PubMed

Affiliation: Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA. alhernan@cshl.edu.

ABSTRACT
Genome assembly projects typically run multiple algorithms in an attempt to find the single best assembly, although those assemblies often have complementary, if untapped, strengths and weaknesses. We present our metassembler algorithm that merges multiple assemblies of a genome into a single superior sequence. We apply it to the four genomes from the Assemblathon competitions and show it consistently and substantially improves the contiguity and quality of each assembly. We also develop guidelines for meta-assembly by systematically evaluating 120 permutations of merging the top 5 assemblies of the first Assemblathon competition. The software is open-source at http://metassembler.sourceforge.net .

No MeSH data available.


Related in: MedlinePlus