Limits...
Fine de novo sequencing of a fungal genome using only SOLiD short read data: verification on Aspergillus oryzae RIB40.

Umemura M, Koyama Y, Takeda I, Hagiwara H, Ikegami T, Koike H, Machida M - PLoS ONE (2013)

Bottom Line: The sequences of secondary metabolite biosynthetic genes and clusters, whose products are of considerable interest in fungal studies due to their potential medicinal, agricultural, and cosmetic properties, were also highly reconstructed in the assembled scaffolds.Based on these findings, we concluded that de novo genome sequencing using only SOLiD short reads is feasible and practical for molecular biological study of fungi.We also investigated the effect of filtering low quality data, library insert size, and k-mer size on the assembly performance, and recommend for the assembly use of mild filtered read data where the N50 was not so degraded and the library has an insert size of ∼2.0 kb, and k-mer size 33.

View Article: PubMed Central - PubMed

Affiliation: National Institute of Advanced Industrial Science and Technology (AIST), Sapporo, Hokkaido, Japan.

ABSTRACT
The development of next-generation sequencing (NGS) technologies has dramatically increased the throughput, speed, and efficiency of genome sequencing. The short read data generated from NGS platforms, such as SOLiD and Illumina, are quite useful for mapping analysis. However, the SOLiD read data with lengths of <60 bp have been considered to be too short for de novo genome sequencing. Here, to investigate whether de novo sequencing of fungal genomes is possible using only SOLiD short read sequence data, we performed de novo assembly of the Aspergillus oryzae RIB40 genome using only SOLiD read data of 50 bp generated from mate-paired libraries with 2.8- or 1.9-kb insert sizes. The assembled scaffolds showed an N50 value of 1.6 Mb, a 22-fold increase than those obtained using only SOLiD short read in other published reports. In addition, almost 99% of the reference genome was accurately aligned by the assembled scaffold fragments in long lengths. The sequences of secondary metabolite biosynthetic genes and clusters, whose products are of considerable interest in fungal studies due to their potential medicinal, agricultural, and cosmetic properties, were also highly reconstructed in the assembled scaffolds. Based on these findings, we concluded that de novo genome sequencing using only SOLiD short reads is feasible and practical for molecular biological study of fungi. We also investigated the effect of filtering low quality data, library insert size, and k-mer size on the assembly performance, and recommend for the assembly use of mild filtered read data where the N50 was not so degraded and the library has an insert size of ∼2.0 kb, and k-mer size 33.

Show MeSH
R50 and N50 values with different k-mer sizes.(a) The R50 and (b) the N50 values for lib2.8.qv10 (closed square) and lib1.9.qv10 (open square). The R50 value corresponds to N50 using sequence fragments of the reference genome covered by highly accurate sequences of assembled scaffolds.
© Copyright Policy
Related In: Results  -  Collection


getmorefigures.php?uid=PMC3646829&req=5

pone-0063673-g006: R50 and N50 values with different k-mer sizes.(a) The R50 and (b) the N50 values for lib2.8.qv10 (closed square) and lib1.9.qv10 (open square). The R50 value corresponds to N50 using sequence fragments of the reference genome covered by highly accurate sequences of assembled scaffolds.

Mentions: The profile of cumulative lengths shows that less redundant scaffolds were generated when using larger k-mer size in both lib2.8.qv10 and lib1.9.qv10 (Fig. 5b, c). The scaffolding ratio, the ratio of contig number to scaffold number, also increased using the larger k-mer size (Table 6). As a result, the reference gene and genome sequences were well reconstructed with longer scaffold fragments when using a longer k-mer size up to 33 (Fig. 3b, 4). The R50 value increased with larger k-mer sizes, and reached a plateau at k-mer of 33 in lib2.8 (Fig. 6a). The k-mer size improved the assembly results more in lib1.9 than in lib2.8; the R50 values were slightly smaller in lib1.9 than in lib2.8 when the k-mer size was under 29, but became longer in lib1.9 when k-mer was 29 or larger. It should be noticed that the profile of N50 is different from that of R50 because N50 does not include the effect of misarrangements. If using N50 as a criteria, the most adequate k-mer size would be 29 in both lib2.8.qv10 and lib1.9.qv10 (Fig. 6b), but this is not sufficient for correctly estimating assembly performance.


Fine de novo sequencing of a fungal genome using only SOLiD short read data: verification on Aspergillus oryzae RIB40.

Umemura M, Koyama Y, Takeda I, Hagiwara H, Ikegami T, Koike H, Machida M - PLoS ONE (2013)

R50 and N50 values with different k-mer sizes.(a) The R50 and (b) the N50 values for lib2.8.qv10 (closed square) and lib1.9.qv10 (open square). The R50 value corresponds to N50 using sequence fragments of the reference genome covered by highly accurate sequences of assembled scaffolds.
© Copyright Policy
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC3646829&req=5

pone-0063673-g006: R50 and N50 values with different k-mer sizes.(a) The R50 and (b) the N50 values for lib2.8.qv10 (closed square) and lib1.9.qv10 (open square). The R50 value corresponds to N50 using sequence fragments of the reference genome covered by highly accurate sequences of assembled scaffolds.
Mentions: The profile of cumulative lengths shows that less redundant scaffolds were generated when using larger k-mer size in both lib2.8.qv10 and lib1.9.qv10 (Fig. 5b, c). The scaffolding ratio, the ratio of contig number to scaffold number, also increased using the larger k-mer size (Table 6). As a result, the reference gene and genome sequences were well reconstructed with longer scaffold fragments when using a longer k-mer size up to 33 (Fig. 3b, 4). The R50 value increased with larger k-mer sizes, and reached a plateau at k-mer of 33 in lib2.8 (Fig. 6a). The k-mer size improved the assembly results more in lib1.9 than in lib2.8; the R50 values were slightly smaller in lib1.9 than in lib2.8 when the k-mer size was under 29, but became longer in lib1.9 when k-mer was 29 or larger. It should be noticed that the profile of N50 is different from that of R50 because N50 does not include the effect of misarrangements. If using N50 as a criteria, the most adequate k-mer size would be 29 in both lib2.8.qv10 and lib1.9.qv10 (Fig. 6b), but this is not sufficient for correctly estimating assembly performance.

Bottom Line: The sequences of secondary metabolite biosynthetic genes and clusters, whose products are of considerable interest in fungal studies due to their potential medicinal, agricultural, and cosmetic properties, were also highly reconstructed in the assembled scaffolds.Based on these findings, we concluded that de novo genome sequencing using only SOLiD short reads is feasible and practical for molecular biological study of fungi.We also investigated the effect of filtering low quality data, library insert size, and k-mer size on the assembly performance, and recommend for the assembly use of mild filtered read data where the N50 was not so degraded and the library has an insert size of ∼2.0 kb, and k-mer size 33.

View Article: PubMed Central - PubMed

Affiliation: National Institute of Advanced Industrial Science and Technology (AIST), Sapporo, Hokkaido, Japan.

ABSTRACT
The development of next-generation sequencing (NGS) technologies has dramatically increased the throughput, speed, and efficiency of genome sequencing. The short read data generated from NGS platforms, such as SOLiD and Illumina, are quite useful for mapping analysis. However, the SOLiD read data with lengths of <60 bp have been considered to be too short for de novo genome sequencing. Here, to investigate whether de novo sequencing of fungal genomes is possible using only SOLiD short read sequence data, we performed de novo assembly of the Aspergillus oryzae RIB40 genome using only SOLiD read data of 50 bp generated from mate-paired libraries with 2.8- or 1.9-kb insert sizes. The assembled scaffolds showed an N50 value of 1.6 Mb, a 22-fold increase than those obtained using only SOLiD short read in other published reports. In addition, almost 99% of the reference genome was accurately aligned by the assembled scaffold fragments in long lengths. The sequences of secondary metabolite biosynthetic genes and clusters, whose products are of considerable interest in fungal studies due to their potential medicinal, agricultural, and cosmetic properties, were also highly reconstructed in the assembled scaffolds. Based on these findings, we concluded that de novo genome sequencing using only SOLiD short reads is feasible and practical for molecular biological study of fungi. We also investigated the effect of filtering low quality data, library insert size, and k-mer size on the assembly performance, and recommend for the assembly use of mild filtered read data where the N50 was not so degraded and the library has an insert size of ∼2.0 kb, and k-mer size 33.

Show MeSH