Limits...
Host Subtraction, Filtering and Assembly Validations for Novel Viral Discovery Using Next Generation Sequencing Data.

Daly GM, Leggett RM, Rowe W, Stubbs S, Wilkinson M, Ramirez-Gonzalez RH, Caccamo M, Bernal W, Heeney JL - PLoS ONE (2015)

Bottom Line: Comparison of assembly algorithms with pre-assembly host-mapping subtraction using a short-read mapping tool, a k-mer frequency based filter and a low complexity filter, has been validated for viral discovery with Illumina data derived from naturally infected liver tissue and simulated data.Assembled contig numbers were significantly reduced (up to 99.97%) by the application of these pre-assembly filtering methods.This approach provides a validated method for maximizing viral contig size as well as reducing the total number of assembled contigs that require down-stream analysis as putative viral nucleic acids.

View Article: PubMed Central - PubMed

Affiliation: Lab of Viral Zoonotics, Department of Veterinary Medicine, University of Cambridge, Madingley Road, Cambridge, CB30ES, United Kingdom.

ABSTRACT
The use of next generation sequencing (NGS) to identify novel viral sequences from eukaryotic tissue samples is challenging. Issues can include the low proportion and copy number of viral reads and the high number of contigs (post-assembly), making subsequent viral analysis difficult. Comparison of assembly algorithms with pre-assembly host-mapping subtraction using a short-read mapping tool, a k-mer frequency based filter and a low complexity filter, has been validated for viral discovery with Illumina data derived from naturally infected liver tissue and simulated data. Assembled contig numbers were significantly reduced (up to 99.97%) by the application of these pre-assembly filtering methods. This approach provides a validated method for maximizing viral contig size as well as reducing the total number of assembled contigs that require down-stream analysis as putative viral nucleic acids.

No MeSH data available.


Host mapping subtraction by reference set.
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4476701&req=5

pone.0129059.g003: Host mapping subtraction by reference set.

Mentions: A database of 62 human virus chromatids from 35 distinct viruses were used to generate simulated Illumina read pairs (10x coverage depth) with the use of pIRS (methods). These paired reads (167,004) were combined with 16.7 million Illumina paired sequence reads from a ‘clean’ healthy liver sample (methods). Using the mapping algorithms CLC, BWA and Bowtie (methods) we investigated the percentage subtraction of host and viral sequence reads following mapping to human reference sequence data sets including the human genome, the human mitochondrial genome and a human rRNA sequence set (see methods) [Fig 2]. Mapper settings equivalent to 90% homology over 80% of the sequence read length (0.8/0.9) subtracted 95% +/- 0.2% of the host-derived Illumina reads without a major loss of viral sequence reads (0.02% +/- 0.005) for the algorithms tested [Fig 2] (Bowtie data not included for clarity). Analysis of the subtracted viral sequence reads at this mapping stringency revealed that they were exclusively homopolymeric tracts and / or repeats with a low % GC content. Consequently, this mapping stringency was used for subsequent validation studies. Mapping of the remaining (un-subtracted) sequences to the viral references revealed no gaps in coverage relative to the non-subtracted original read set. We further characterized the effect of subtracting host sequences using the human genome build/mitochondria and the rRNA reference sets independently, with Illumina (100nt, paired-end) data sets [Fig 3] derived from Total RNA, randomly primed and amplified, from idiopathic liver samples (n = 10). The greatest number of host sequence reads was subtracted through mapping to the human genome including the mitochondria (87.4 ± 4.3%, arithmetic mean ± standard deviation) with an average of 2.1 ± 1.3% of the host sequence reads subtracted by mapping to the human rRNA data set alone. The use of both reference data sets was additive, with 89.2% of Illumina reads removed. Additionally we applied hostmapping subtraction to our rRNA references with Illumina paired read sets (n = 10) derived from nuclease treated liver cytosolic samples (an encapsidated virus enrichment protocol) [8] (see enriched, Fig 3). Mapping to the rRNA set subtracted 17.9% ± 13.6% of the reads showing the potential benefit for rRNA mapping subtraction prior to assembly in this context. We next tested the effect of sequence read subtraction using Kontaminant (methods), a newly developed tool for read filtering developed at the Genome Analysis Centre (TGAC), Norwich, UK [42]. We again used our artificial Illumina paired–end viral read set embedded in healthy liver derived Illumina paired-end reads (as described above). Following k-mer filtering the data set was reduced to 67,432 reads. 99.997% of the human liver reads and 4.7% of the viral reads were removed (not including duplicates) [Fig 4]. The GC % content profile of the viral reads that were filtered mirrored the unfiltered viral reads. Additionally, mapping of the remaining (unfiltered) reads to the reference viral genomes revealed that minimal gaps in coverage were present (0.6 ± 0.6%) at the reference terminal ends only, relative to the unfiltered read set.


Host Subtraction, Filtering and Assembly Validations for Novel Viral Discovery Using Next Generation Sequencing Data.

Daly GM, Leggett RM, Rowe W, Stubbs S, Wilkinson M, Ramirez-Gonzalez RH, Caccamo M, Bernal W, Heeney JL - PLoS ONE (2015)

Host mapping subtraction by reference set.
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4476701&req=5

pone.0129059.g003: Host mapping subtraction by reference set.
Mentions: A database of 62 human virus chromatids from 35 distinct viruses were used to generate simulated Illumina read pairs (10x coverage depth) with the use of pIRS (methods). These paired reads (167,004) were combined with 16.7 million Illumina paired sequence reads from a ‘clean’ healthy liver sample (methods). Using the mapping algorithms CLC, BWA and Bowtie (methods) we investigated the percentage subtraction of host and viral sequence reads following mapping to human reference sequence data sets including the human genome, the human mitochondrial genome and a human rRNA sequence set (see methods) [Fig 2]. Mapper settings equivalent to 90% homology over 80% of the sequence read length (0.8/0.9) subtracted 95% +/- 0.2% of the host-derived Illumina reads without a major loss of viral sequence reads (0.02% +/- 0.005) for the algorithms tested [Fig 2] (Bowtie data not included for clarity). Analysis of the subtracted viral sequence reads at this mapping stringency revealed that they were exclusively homopolymeric tracts and / or repeats with a low % GC content. Consequently, this mapping stringency was used for subsequent validation studies. Mapping of the remaining (un-subtracted) sequences to the viral references revealed no gaps in coverage relative to the non-subtracted original read set. We further characterized the effect of subtracting host sequences using the human genome build/mitochondria and the rRNA reference sets independently, with Illumina (100nt, paired-end) data sets [Fig 3] derived from Total RNA, randomly primed and amplified, from idiopathic liver samples (n = 10). The greatest number of host sequence reads was subtracted through mapping to the human genome including the mitochondria (87.4 ± 4.3%, arithmetic mean ± standard deviation) with an average of 2.1 ± 1.3% of the host sequence reads subtracted by mapping to the human rRNA data set alone. The use of both reference data sets was additive, with 89.2% of Illumina reads removed. Additionally we applied hostmapping subtraction to our rRNA references with Illumina paired read sets (n = 10) derived from nuclease treated liver cytosolic samples (an encapsidated virus enrichment protocol) [8] (see enriched, Fig 3). Mapping to the rRNA set subtracted 17.9% ± 13.6% of the reads showing the potential benefit for rRNA mapping subtraction prior to assembly in this context. We next tested the effect of sequence read subtraction using Kontaminant (methods), a newly developed tool for read filtering developed at the Genome Analysis Centre (TGAC), Norwich, UK [42]. We again used our artificial Illumina paired–end viral read set embedded in healthy liver derived Illumina paired-end reads (as described above). Following k-mer filtering the data set was reduced to 67,432 reads. 99.997% of the human liver reads and 4.7% of the viral reads were removed (not including duplicates) [Fig 4]. The GC % content profile of the viral reads that were filtered mirrored the unfiltered viral reads. Additionally, mapping of the remaining (unfiltered) reads to the reference viral genomes revealed that minimal gaps in coverage were present (0.6 ± 0.6%) at the reference terminal ends only, relative to the unfiltered read set.

Bottom Line: Comparison of assembly algorithms with pre-assembly host-mapping subtraction using a short-read mapping tool, a k-mer frequency based filter and a low complexity filter, has been validated for viral discovery with Illumina data derived from naturally infected liver tissue and simulated data.Assembled contig numbers were significantly reduced (up to 99.97%) by the application of these pre-assembly filtering methods.This approach provides a validated method for maximizing viral contig size as well as reducing the total number of assembled contigs that require down-stream analysis as putative viral nucleic acids.

View Article: PubMed Central - PubMed

Affiliation: Lab of Viral Zoonotics, Department of Veterinary Medicine, University of Cambridge, Madingley Road, Cambridge, CB30ES, United Kingdom.

ABSTRACT
The use of next generation sequencing (NGS) to identify novel viral sequences from eukaryotic tissue samples is challenging. Issues can include the low proportion and copy number of viral reads and the high number of contigs (post-assembly), making subsequent viral analysis difficult. Comparison of assembly algorithms with pre-assembly host-mapping subtraction using a short-read mapping tool, a k-mer frequency based filter and a low complexity filter, has been validated for viral discovery with Illumina data derived from naturally infected liver tissue and simulated data. Assembled contig numbers were significantly reduced (up to 99.97%) by the application of these pre-assembly filtering methods. This approach provides a validated method for maximizing viral contig size as well as reducing the total number of assembled contigs that require down-stream analysis as putative viral nucleic acids.

No MeSH data available.