Limits...
Host Subtraction, Filtering and Assembly Validations for Novel Viral Discovery Using Next Generation Sequencing Data.

Daly GM, Leggett RM, Rowe W, Stubbs S, Wilkinson M, Ramirez-Gonzalez RH, Caccamo M, Bernal W, Heeney JL - PLoS ONE (2015)

Bottom Line: Comparison of assembly algorithms with pre-assembly host-mapping subtraction using a short-read mapping tool, a k-mer frequency based filter and a low complexity filter, has been validated for viral discovery with Illumina data derived from naturally infected liver tissue and simulated data.Assembled contig numbers were significantly reduced (up to 99.97%) by the application of these pre-assembly filtering methods.This approach provides a validated method for maximizing viral contig size as well as reducing the total number of assembled contigs that require down-stream analysis as putative viral nucleic acids.

View Article: PubMed Central - PubMed

Affiliation: Lab of Viral Zoonotics, Department of Veterinary Medicine, University of Cambridge, Madingley Road, Cambridge, CB30ES, United Kingdom.

ABSTRACT
The use of next generation sequencing (NGS) to identify novel viral sequences from eukaryotic tissue samples is challenging. Issues can include the low proportion and copy number of viral reads and the high number of contigs (post-assembly), making subsequent viral analysis difficult. Comparison of assembly algorithms with pre-assembly host-mapping subtraction using a short-read mapping tool, a k-mer frequency based filter and a low complexity filter, has been validated for viral discovery with Illumina data derived from naturally infected liver tissue and simulated data. Assembled contig numbers were significantly reduced (up to 99.97%) by the application of these pre-assembly filtering methods. This approach provides a validated method for maximizing viral contig size as well as reducing the total number of assembled contigs that require down-stream analysis as putative viral nucleic acids.

No MeSH data available.


Optimal word size for viral assembly with multiple assemblers.
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4476701&req=5

pone.0129059.g005: Optimal word size for viral assembly with multiple assemblers.

Mentions: We compared a range of assembly algorithms and word size settings using four real Illumina datasets: two from HCV infected liver (containing viral reads with a mean reference coverage of 0.7 and 9x) and two from HBV infected liver (containing viral reads with a mean reference coverage of 20 and 200x). The reads were assembled using four algorithms: Velvet [43], MetaCortex [44–45], ABySS [46] and CLC and a range of word sizes. ABySS, Velvet and CLC are genomic assemblers, while MetaCortex is a development of the Cortex assembler aimed at highly diverse metagenomic datasets. The longest viral contigs assembled (as a % of the reference genome length) by each algorithm over a range of word sizes is shown in Fig 5. The CLC algorithm consistently produced the greatest length of reference genome coverage for every sample across the range of depths of viral sequence coverage. Interestingly, the assembly programs Velvet, MetaCortex and Abyss were generally less effective at the 200x deep viral coverage compared to the 20x coverage. The artificial viral sequence dataset was also de novo assembled using CLC v6.5 with extensive testing of the word sizes revealing an optimal word size of 21 (data not shown).


Host Subtraction, Filtering and Assembly Validations for Novel Viral Discovery Using Next Generation Sequencing Data.

Daly GM, Leggett RM, Rowe W, Stubbs S, Wilkinson M, Ramirez-Gonzalez RH, Caccamo M, Bernal W, Heeney JL - PLoS ONE (2015)

Optimal word size for viral assembly with multiple assemblers.
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4476701&req=5

pone.0129059.g005: Optimal word size for viral assembly with multiple assemblers.
Mentions: We compared a range of assembly algorithms and word size settings using four real Illumina datasets: two from HCV infected liver (containing viral reads with a mean reference coverage of 0.7 and 9x) and two from HBV infected liver (containing viral reads with a mean reference coverage of 20 and 200x). The reads were assembled using four algorithms: Velvet [43], MetaCortex [44–45], ABySS [46] and CLC and a range of word sizes. ABySS, Velvet and CLC are genomic assemblers, while MetaCortex is a development of the Cortex assembler aimed at highly diverse metagenomic datasets. The longest viral contigs assembled (as a % of the reference genome length) by each algorithm over a range of word sizes is shown in Fig 5. The CLC algorithm consistently produced the greatest length of reference genome coverage for every sample across the range of depths of viral sequence coverage. Interestingly, the assembly programs Velvet, MetaCortex and Abyss were generally less effective at the 200x deep viral coverage compared to the 20x coverage. The artificial viral sequence dataset was also de novo assembled using CLC v6.5 with extensive testing of the word sizes revealing an optimal word size of 21 (data not shown).

Bottom Line: Comparison of assembly algorithms with pre-assembly host-mapping subtraction using a short-read mapping tool, a k-mer frequency based filter and a low complexity filter, has been validated for viral discovery with Illumina data derived from naturally infected liver tissue and simulated data.Assembled contig numbers were significantly reduced (up to 99.97%) by the application of these pre-assembly filtering methods.This approach provides a validated method for maximizing viral contig size as well as reducing the total number of assembled contigs that require down-stream analysis as putative viral nucleic acids.

View Article: PubMed Central - PubMed

Affiliation: Lab of Viral Zoonotics, Department of Veterinary Medicine, University of Cambridge, Madingley Road, Cambridge, CB30ES, United Kingdom.

ABSTRACT
The use of next generation sequencing (NGS) to identify novel viral sequences from eukaryotic tissue samples is challenging. Issues can include the low proportion and copy number of viral reads and the high number of contigs (post-assembly), making subsequent viral analysis difficult. Comparison of assembly algorithms with pre-assembly host-mapping subtraction using a short-read mapping tool, a k-mer frequency based filter and a low complexity filter, has been validated for viral discovery with Illumina data derived from naturally infected liver tissue and simulated data. Assembled contig numbers were significantly reduced (up to 99.97%) by the application of these pre-assembly filtering methods. This approach provides a validated method for maximizing viral contig size as well as reducing the total number of assembled contigs that require down-stream analysis as putative viral nucleic acids.

No MeSH data available.