Limits...
Host Subtraction, Filtering and Assembly Validations for Novel Viral Discovery Using Next Generation Sequencing Data.

Daly GM, Leggett RM, Rowe W, Stubbs S, Wilkinson M, Ramirez-Gonzalez RH, Caccamo M, Bernal W, Heeney JL - PLoS ONE (2015)

Bottom Line: Comparison of assembly algorithms with pre-assembly host-mapping subtraction using a short-read mapping tool, a k-mer frequency based filter and a low complexity filter, has been validated for viral discovery with Illumina data derived from naturally infected liver tissue and simulated data.Assembled contig numbers were significantly reduced (up to 99.97%) by the application of these pre-assembly filtering methods.This approach provides a validated method for maximizing viral contig size as well as reducing the total number of assembled contigs that require down-stream analysis as putative viral nucleic acids.

View Article: PubMed Central - PubMed

Affiliation: Lab of Viral Zoonotics, Department of Veterinary Medicine, University of Cambridge, Madingley Road, Cambridge, CB30ES, United Kingdom.

ABSTRACT
The use of next generation sequencing (NGS) to identify novel viral sequences from eukaryotic tissue samples is challenging. Issues can include the low proportion and copy number of viral reads and the high number of contigs (post-assembly), making subsequent viral analysis difficult. Comparison of assembly algorithms with pre-assembly host-mapping subtraction using a short-read mapping tool, a k-mer frequency based filter and a low complexity filter, has been validated for viral discovery with Illumina data derived from naturally infected liver tissue and simulated data. Assembled contig numbers were significantly reduced (up to 99.97%) by the application of these pre-assembly filtering methods. This approach provides a validated method for maximizing viral contig size as well as reducing the total number of assembled contigs that require down-stream analysis as putative viral nucleic acids.

No MeSH data available.


Related in: MedlinePlus

Idiopathic hepatitis liver datasets: a) Read reduction following mapping subtraction and k-mer similarity filtering.b) Effect of k-mer filtering (K-mer) & host mapping subtraction (Map) on post-assembly contig number. c) Effect of k-mer filtering (K-mer) & host mapping subtraction (Map) on viral contig size and reference coverage.
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4476701&req=5

pone.0129059.g012: Idiopathic hepatitis liver datasets: a) Read reduction following mapping subtraction and k-mer similarity filtering.b) Effect of k-mer filtering (K-mer) & host mapping subtraction (Map) on post-assembly contig number. c) Effect of k-mer filtering (K-mer) & host mapping subtraction (Map) on viral contig size and reference coverage.

Mentions: Two human hepatitis liver samples clinically defined as idiopathic (explanted liver from the transplant setting) and confirmed as negative for hepatitis associated viruses prior to transplant were processed for total RNA and SISPA processed for Illumina NGS as described in materials and methods. Sample 1 (121 million 100nt reads) and sample 2 (105 million reads 100nt reads) were filtered by the application of a short read mapper, with and without the application of the k-mer filter (Kontaminant). Read subtraction following these processes is shown in Fig 12A with subtraction levels comparable to the artificial metagenomics dataset and the control viral sets used previously. Optimal de novo assembly yielded contig numbers again commensurate with the previously tested control samples with an approximate decrease in contigs following the application of the short read mapper subtraction of ~100-fold for both idiopathic samples tested. Secondary application of the k-mer filter decreased the contig numbers further for samples 1 and 2 by 90% (620 contigs assembled) and 83% respectively (8027 contigs assembled) and shown in Fig 12B. These contig numbers are small enough to use BLASTn to NCBIntDB in a matter of hours and by tBLASTx in a few days using a standard 8-core desktop computer. To determine if the idiopathic samples included viral contigs we first blasted by both BLASTn and tBLASTx to NCBIntDB (materials and methods). For the purposes of comparison between pre-assembly filtering methods we considered NCBInt BLAST hits to be viral only following the application of high stringency cut-offs (as detailed in materials and methods). Best viral hit references were used to align the putative viral contigs from the different samples to ascertain the changes in the largest viral contig present and the total reference coverage of the viral contigs [Fig 12C.]. Between the two idiopathic samples, four distinct viral hits were detected, a) a retrovirus–like sequence with greatest homology to the SIV sequence U42720 complete cds (9068nt), b) HCV hits with greatest homology to the HCV2b sequence AY232737 complete cds (9711nt), c) TTV hits with greatest homology to the TTV sequence AF345527 complete cds (3208nt), d) a herpes-like hit with greatest homology to the HHV4 sequence AJ507799 complete genome (171823nt). The U42720 largest contig and total coverage were reduced slightly (12% and 7% respectively) by the addition of the k-mer filter pre-assembly. The AY232737 largest contig was reduced (1.3%) by the addition of the k-mer filter pre-assembly and total coverage was increased (82%) by the addition of the k-mer filter pre-assembly. The AF345527 largest contig was reduced (19%) by the addition of the k-mer filter pre-assembly and total coverage was increased (20.1%) by the addition of the k-mer filter pre-assembly. The AJ507799 largest contig and total coverage were unchanged by the addition of the k-mer filter pre-assembly. Overall, post-assembly data is consistent with our viral control and artificial datasets.


Host Subtraction, Filtering and Assembly Validations for Novel Viral Discovery Using Next Generation Sequencing Data.

Daly GM, Leggett RM, Rowe W, Stubbs S, Wilkinson M, Ramirez-Gonzalez RH, Caccamo M, Bernal W, Heeney JL - PLoS ONE (2015)

Idiopathic hepatitis liver datasets: a) Read reduction following mapping subtraction and k-mer similarity filtering.b) Effect of k-mer filtering (K-mer) & host mapping subtraction (Map) on post-assembly contig number. c) Effect of k-mer filtering (K-mer) & host mapping subtraction (Map) on viral contig size and reference coverage.
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4476701&req=5

pone.0129059.g012: Idiopathic hepatitis liver datasets: a) Read reduction following mapping subtraction and k-mer similarity filtering.b) Effect of k-mer filtering (K-mer) & host mapping subtraction (Map) on post-assembly contig number. c) Effect of k-mer filtering (K-mer) & host mapping subtraction (Map) on viral contig size and reference coverage.
Mentions: Two human hepatitis liver samples clinically defined as idiopathic (explanted liver from the transplant setting) and confirmed as negative for hepatitis associated viruses prior to transplant were processed for total RNA and SISPA processed for Illumina NGS as described in materials and methods. Sample 1 (121 million 100nt reads) and sample 2 (105 million reads 100nt reads) were filtered by the application of a short read mapper, with and without the application of the k-mer filter (Kontaminant). Read subtraction following these processes is shown in Fig 12A with subtraction levels comparable to the artificial metagenomics dataset and the control viral sets used previously. Optimal de novo assembly yielded contig numbers again commensurate with the previously tested control samples with an approximate decrease in contigs following the application of the short read mapper subtraction of ~100-fold for both idiopathic samples tested. Secondary application of the k-mer filter decreased the contig numbers further for samples 1 and 2 by 90% (620 contigs assembled) and 83% respectively (8027 contigs assembled) and shown in Fig 12B. These contig numbers are small enough to use BLASTn to NCBIntDB in a matter of hours and by tBLASTx in a few days using a standard 8-core desktop computer. To determine if the idiopathic samples included viral contigs we first blasted by both BLASTn and tBLASTx to NCBIntDB (materials and methods). For the purposes of comparison between pre-assembly filtering methods we considered NCBInt BLAST hits to be viral only following the application of high stringency cut-offs (as detailed in materials and methods). Best viral hit references were used to align the putative viral contigs from the different samples to ascertain the changes in the largest viral contig present and the total reference coverage of the viral contigs [Fig 12C.]. Between the two idiopathic samples, four distinct viral hits were detected, a) a retrovirus–like sequence with greatest homology to the SIV sequence U42720 complete cds (9068nt), b) HCV hits with greatest homology to the HCV2b sequence AY232737 complete cds (9711nt), c) TTV hits with greatest homology to the TTV sequence AF345527 complete cds (3208nt), d) a herpes-like hit with greatest homology to the HHV4 sequence AJ507799 complete genome (171823nt). The U42720 largest contig and total coverage were reduced slightly (12% and 7% respectively) by the addition of the k-mer filter pre-assembly. The AY232737 largest contig was reduced (1.3%) by the addition of the k-mer filter pre-assembly and total coverage was increased (82%) by the addition of the k-mer filter pre-assembly. The AF345527 largest contig was reduced (19%) by the addition of the k-mer filter pre-assembly and total coverage was increased (20.1%) by the addition of the k-mer filter pre-assembly. The AJ507799 largest contig and total coverage were unchanged by the addition of the k-mer filter pre-assembly. Overall, post-assembly data is consistent with our viral control and artificial datasets.

Bottom Line: Comparison of assembly algorithms with pre-assembly host-mapping subtraction using a short-read mapping tool, a k-mer frequency based filter and a low complexity filter, has been validated for viral discovery with Illumina data derived from naturally infected liver tissue and simulated data.Assembled contig numbers were significantly reduced (up to 99.97%) by the application of these pre-assembly filtering methods.This approach provides a validated method for maximizing viral contig size as well as reducing the total number of assembled contigs that require down-stream analysis as putative viral nucleic acids.

View Article: PubMed Central - PubMed

Affiliation: Lab of Viral Zoonotics, Department of Veterinary Medicine, University of Cambridge, Madingley Road, Cambridge, CB30ES, United Kingdom.

ABSTRACT
The use of next generation sequencing (NGS) to identify novel viral sequences from eukaryotic tissue samples is challenging. Issues can include the low proportion and copy number of viral reads and the high number of contigs (post-assembly), making subsequent viral analysis difficult. Comparison of assembly algorithms with pre-assembly host-mapping subtraction using a short-read mapping tool, a k-mer frequency based filter and a low complexity filter, has been validated for viral discovery with Illumina data derived from naturally infected liver tissue and simulated data. Assembled contig numbers were significantly reduced (up to 99.97%) by the application of these pre-assembly filtering methods. This approach provides a validated method for maximizing viral contig size as well as reducing the total number of assembled contigs that require down-stream analysis as putative viral nucleic acids.

No MeSH data available.


Related in: MedlinePlus