Limits...
A robust and cost-effective approach to sequence and analyze complete genomes of small RNA viruses

View Article: PubMed Central - PubMed

ABSTRACT

Background: Next-generation sequencing (NGS) allows ultra-deep sequencing of nucleic acids. The use of sequence-independent amplification of viral nucleic acids without utilization of target-specific primers provides advantages over traditional sequencing methods and allows detection of unsuspected variants and co-infecting agents. However, NGS is not widely used for small RNA viruses because of incorrectly perceived cost estimates and inefficient utilization of freely available bioinformatics tools.

Methods: In this study, we have utilized NGS-based random sequencing of total RNA combined with barcode multiplexing of libraries to quickly, effectively and simultaneously characterize the genomic sequences of multiple avian paramyxoviruses. Thirty libraries were prepared from diagnostic samples amplified in allantoic fluids and their total RNAs were sequenced in a single flow cell on an Illumina MiSeq instrument. After digital normalization, data were assembled using the MIRA assembler within a customized workflow on the Galaxy platform.

Results: Twenty-eight avian paramyxovirus 1 (APMV-1), one APMV-13, four avian influenza and two infectious bronchitis virus complete or nearly complete genome sequences were obtained from the single run. The 29 avian paramyxovirus genomes displayed 99.6% mean coverage based on bases with Phred quality scores of 30 or more. The lower and upper quartiles of sample median depth per position for those 29 samples were 2984 and 6894, respectively, indicating coverage across samples sufficient for deep variant analysis. Sample processing and library preparation took approximately 25–30 h, the sequencing run took 39 h, and processing through the Galaxy workflow took approximately 2–3 h. The cost of all steps, excluding labor, was estimated to be 106 USD per sample.

Conclusions: This work describes an efficient multiplexing NGS approach, a detailed analysis workflow, and customized tools for the characterization of the genomes of RNA viruses. The combination of multiplexing NGS technology with the Galaxy workflow platform resulted in a fast, user-friendly, and cost-efficient protocol for the simultaneous characterization of multiple full-length viral genomes. Twenty-nine full-length or near-full-length APMV genomes with a high median depth were successfully sequenced out of 30 samples. The applied de novo assembly approach also allowed identification of mixed viral populations in some of the samples.

Electronic supplementary material: The online version of this article (doi:10.1186/s12985-017-0741-5) contains supplementary material, which is available to authorized users.

No MeSH data available.


Related in: MedlinePlus

Analysis of Newcastle disease virus genome assembly at various read depths. Shown are the longest contig produced at each read depth as a fraction of the full genome length. Subsamples up to 200x were generated using digital normalization. Above 200x, additional reads were added using random subsampling (due to issues with high median cutoffs in the kh-mer package). At each subsampling depth, the final velvetg assembly was optimized for maximum contig length based on the “cov_cutoff” parameter
© Copyright Policy - OpenAccess
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC5384157&req=5

Fig2: Analysis of Newcastle disease virus genome assembly at various read depths. Shown are the longest contig produced at each read depth as a fraction of the full genome length. Subsamples up to 200x were generated using digital normalization. Above 200x, additional reads were added using random subsampling (due to issues with high median cutoffs in the kh-mer package). At each subsampling depth, the final velvetg assembly was optimized for maximum contig length based on the “cov_cutoff” parameter

Mentions: In order to take advantage of the overlapping reads, a merging step was introduced to produce longer pseudo-reads and to reduce complexity of the assembly task. An essential optimization was made by reducing the estimated coverage depth to a level that would still produce optimal assemblies. Two techniques for data reduction were investigated. Random sub-sampling resulted in loss of specific regions in the genome with reproducibly low coverage (data not shown). Digital normalization, which aims to down-sample high-coverage regions while preserving reads from low-coverage areas, provided means for decreasing the number of used reads to an optimal level without loss of data, and thus, was incorporated into the customized Galaxy workflow prior to assembly. In order to determine an optimal target depth for assembly, preliminary test assemblies using the Velvet assembler v1.2.10 [44] were performed on a geometric progression of sampling depths from 10x to 10000x (the approximate depth of the raw data) with an additional optimization of the velvetg “cov_cutoff” parameter for each depth (parameter used to low coverage nodes). The results indicated that optimal (in this case, full-length) assembly occurred over a range of approximately one order of magnitude (100x to 1000x). Below and above this range, fragmentation began to occur (Fig. 2).Fig. 2


A robust and cost-effective approach to sequence and analyze complete genomes of small RNA viruses
Analysis of Newcastle disease virus genome assembly at various read depths. Shown are the longest contig produced at each read depth as a fraction of the full genome length. Subsamples up to 200x were generated using digital normalization. Above 200x, additional reads were added using random subsampling (due to issues with high median cutoffs in the kh-mer package). At each subsampling depth, the final velvetg assembly was optimized for maximum contig length based on the “cov_cutoff” parameter
© Copyright Policy - OpenAccess
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC5384157&req=5

Fig2: Analysis of Newcastle disease virus genome assembly at various read depths. Shown are the longest contig produced at each read depth as a fraction of the full genome length. Subsamples up to 200x were generated using digital normalization. Above 200x, additional reads were added using random subsampling (due to issues with high median cutoffs in the kh-mer package). At each subsampling depth, the final velvetg assembly was optimized for maximum contig length based on the “cov_cutoff” parameter
Mentions: In order to take advantage of the overlapping reads, a merging step was introduced to produce longer pseudo-reads and to reduce complexity of the assembly task. An essential optimization was made by reducing the estimated coverage depth to a level that would still produce optimal assemblies. Two techniques for data reduction were investigated. Random sub-sampling resulted in loss of specific regions in the genome with reproducibly low coverage (data not shown). Digital normalization, which aims to down-sample high-coverage regions while preserving reads from low-coverage areas, provided means for decreasing the number of used reads to an optimal level without loss of data, and thus, was incorporated into the customized Galaxy workflow prior to assembly. In order to determine an optimal target depth for assembly, preliminary test assemblies using the Velvet assembler v1.2.10 [44] were performed on a geometric progression of sampling depths from 10x to 10000x (the approximate depth of the raw data) with an additional optimization of the velvetg “cov_cutoff” parameter for each depth (parameter used to low coverage nodes). The results indicated that optimal (in this case, full-length) assembly occurred over a range of approximately one order of magnitude (100x to 1000x). Below and above this range, fragmentation began to occur (Fig. 2).Fig. 2

View Article: PubMed Central - PubMed

ABSTRACT

Background: Next-generation sequencing (NGS) allows ultra-deep sequencing of nucleic acids. The use of sequence-independent amplification of viral nucleic acids without utilization of target-specific primers provides advantages over traditional sequencing methods and allows detection of unsuspected variants and co-infecting agents. However, NGS is not widely used for small RNA viruses because of incorrectly perceived cost estimates and inefficient utilization of freely available bioinformatics tools.

Methods: In this study, we have utilized NGS-based random sequencing of total RNA combined with barcode multiplexing of libraries to quickly, effectively and simultaneously characterize the genomic sequences of multiple avian paramyxoviruses. Thirty libraries were prepared from diagnostic samples amplified in allantoic fluids and their total RNAs were sequenced in a single flow cell on an Illumina MiSeq instrument. After digital normalization, data were assembled using the MIRA assembler within a customized workflow on the Galaxy platform.

Results: Twenty-eight avian paramyxovirus 1 (APMV-1), one APMV-13, four avian influenza and two infectious bronchitis virus complete or nearly complete genome sequences were obtained from the single run. The 29 avian paramyxovirus genomes displayed 99.6% mean coverage based on bases with Phred quality scores of 30 or more. The lower and upper quartiles of sample median depth per position for those 29 samples were 2984 and 6894, respectively, indicating coverage across samples sufficient for deep variant analysis. Sample processing and library preparation took approximately 25–30 h, the sequencing run took 39 h, and processing through the Galaxy workflow took approximately 2–3 h. The cost of all steps, excluding labor, was estimated to be 106 USD per sample.

Conclusions: This work describes an efficient multiplexing NGS approach, a detailed analysis workflow, and customized tools for the characterization of the genomes of RNA viruses. The combination of multiplexing NGS technology with the Galaxy workflow platform resulted in a fast, user-friendly, and cost-efficient protocol for the simultaneous characterization of multiple full-length viral genomes. Twenty-nine full-length or near-full-length APMV genomes with a high median depth were successfully sequenced out of 30 samples. The applied de novo assembly approach also allowed identification of mixed viral populations in some of the samples.

Electronic supplementary material: The online version of this article (doi:10.1186/s12985-017-0741-5) contains supplementary material, which is available to authorized users.

No MeSH data available.


Related in: MedlinePlus