Limits...
Fragmentation and Coverage Variation in Viral Metagenome Assemblies, and Their Effect in Diversity Calculations.

García-López R, Vázquez-Castellanos JF, Moya A - Front Bioeng Biotechnol (2015)

Bottom Line: Alpha diversity calculated from contigs as OTUs resulted in significantly higher values for all assemblies when compared with actual species distribution, showing an overestimation due to the increased predicted abundance.Conversely, using PHACCS resulted in lower values for all assemblers.Using contigs for calculating alpha diversity result in overestimation but it is usually the only approach available.

View Article: PubMed Central - PubMed

Affiliation: Área de Genómica y Salud, Fundación para el Fomento de la Investigación Sanitaria y Biomédica de la Comunidad Valenciana (FISABIO)-Salud Pública , Valencia , Spain ; Institut Cavanilles de Biodiversitat i Biologia Evolutiva, Universitat de València , Paterna , Spain ; Consorcio de Investigación Biomédica en Red especializado en Epidemiología y Salud Pública (CIBERESP) , Madrid , Spain.

ABSTRACT
Metagenomic libraries consist of DNA fragments from diverse species, with varying genome size and abundance. High-throughput sequencing platforms produce large volumes of reads from these libraries, which may be assembled into contigs, ideally resembling the original larger genomic sequences. The uneven species distribution, along with the stochasticity in sample processing and sequencing bias, impacts the success of accurate sequence assembly. Several assemblers enable the processing of viral metagenomic data de novo, generally using overlap layout consensus or de Bruijn graph approaches for contig assembly. The success of viral genomic reconstruction in these datasets is limited by the degree of fragmentation of each genome in the sample, which is dependent on the sequencing effort and the genome length. Depending on ecological, biological, or procedural biases, some fragments have a higher prevalence, or coverage, in the assembly. However, assemblers must face challenges, such as the formation of chimerical structures and intra-species variability. Diversity calculation relies on the classification of the sequences that comprise a metagenomic dataset. Whenever the corresponding genomic and taxonomic information is available, contigs matching the same species can be classified accordingly and the coverage of its genome can be calculated for that species. This may be used to compare populations by estimating abundance and assessing species distribution from this data. Nevertheless, the coverage does not take into account the degree of fragmentation, or else genome completeness, and is not necessarily representative of actual species distribution in the samples. Furthermore, undetermined sequences are abundant in viral metagenomic datasets, resulting in several independent contigs that cannot be assigned by homology or genomic information. These may only be classified as different operational taxonomic units (OTUs), sometimes remaining inadvisably unrelated. Thus, calculations using contigs as different OTUs ultimately overestimate diversity when compared to diversity calculated from species coverage. In order to compare the effect of coverage and fragmentation, we generated three sets of simulated Illumina paired-end reads with different sequencing depths. We compared different assemblies performed with RayMeta, CLC Assembly Cell, MEGAHIT, SPAdes, Meta-IDBA, SOAPdenovo, Velvet, Metavelvet, and MIRA with the best attainable assemblies for each dataset (formed by arranging data using known genome coordinates) by calculating different assembly statistics. A new fragmentation score was included to estimate the degree of genome fragmentation of each taxon and adjust the coverage accordingly. The abundance in the metagenome was compared by bootstrapping the assembly data and hierarchically clustering them with the best possible assembly. Additionally, richness and diversity indexes were calculated for all the resulting assemblies and were assessed under two distributions: contigs as independent OTUs and sequences classified by species. Finally, we search for the strongest correlations between the diversity indexes and the different assembly statistics. Although fragmentation was dependent of genome coverage, it was not as heavily influenced by the assembler. The sequencing depth was the predominant attractor that influenced the success of the assemblies. The coverage increased notoriously in larger datasets, whereas fragmentation values remained lower and unsaturated. While still far from obtaining the ideal assemblies, the RayMeta, SPAdes, and the CLC assemblers managed to build the most accurate contigs with larger datasets while Meta-IDBA showed a good performance with the medium-sized dataset, even after the adjusted coverage was calculated. Their resulting assemblies showed the highest coverage scores and the lowest fragmentation values. Alpha diversity calculated from contigs as OTUs resulted in significantly higher values for all assemblies when compared with actual species distribution, showing an overestimation due to the increased predicted abundance. Conversely, using PHACCS resulted in lower values for all assemblers. Different association methods (random-forest, generalized linear models, and the Spearman correlation index) support the number of contigs, the coverage, and fragmentation as the assembly parameters that most affect the estimation of the alpha diversity. Coverage calculations may provide an insight into relative completeness of a genome but they overlook missing fragments or overly separated sequences in a genome. The assembly of a highly fragmented genomes with high coverage may still lead to the clustering of different OTUs that are actually different fragments of a genome. Thus, it proves useful to penalize coverage with a fragmentation score. Using contigs for calculating alpha diversity result in overestimation but it is usually the only approach available. Still, it is enough for sample comparison. The best approach may be determined by choosing the assembler that better fits the sequencing depth and adjusting the parameters for longer accurate contigs whenever possible whereas diversity may be calculated considering taxonomical and genomic information if available.

No MeSH data available.


Related in: MedlinePlus

Associations of the assembly statistics and the alpha diversity metrics. Correlation plot representing all possible associations between the assembly estimators and the alpha diversity estimators. Black dots represent Spearman’s rank correlation with statistical significance. Green boxes represents all the assembly statistics that were significant for the generalized linear model that predicted the alpha diversity index, the red stars represent variables that most increase the mean square error (MSE) for the model determined by a random-Forest algorithm. Blue ellipses tilted right represent positive correlations, whereas negative correlations are represented by red ellipses tilted left. Black points represent statistically significant correlations (q-value <0.05).
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4585024&req=5

Figure 7: Associations of the assembly statistics and the alpha diversity metrics. Correlation plot representing all possible associations between the assembly estimators and the alpha diversity estimators. Black dots represent Spearman’s rank correlation with statistical significance. Green boxes represents all the assembly statistics that were significant for the generalized linear model that predicted the alpha diversity index, the red stars represent variables that most increase the mean square error (MSE) for the model determined by a random-Forest algorithm. Blue ellipses tilted right represent positive correlations, whereas negative correlations are represented by red ellipses tilted left. Black points represent statistically significant correlations (q-value <0.05).

Mentions: Figure 7 resumes the associations predicted by the three regressions methods. The GLM method predicted more associations between the diversity estimators than the other two, suggesting most parameters may have an influence on the diversity. The other two, the percentage of increment of the mean square error (%IncMSE) produced by Random Forest, and the significance Spearman correlation p-values (SCp), were more conservative in establishing associations between the alpha diversity estimators and assembly parameters, avoiding most of the chimeric contigs.


Fragmentation and Coverage Variation in Viral Metagenome Assemblies, and Their Effect in Diversity Calculations.

García-López R, Vázquez-Castellanos JF, Moya A - Front Bioeng Biotechnol (2015)

Associations of the assembly statistics and the alpha diversity metrics. Correlation plot representing all possible associations between the assembly estimators and the alpha diversity estimators. Black dots represent Spearman’s rank correlation with statistical significance. Green boxes represents all the assembly statistics that were significant for the generalized linear model that predicted the alpha diversity index, the red stars represent variables that most increase the mean square error (MSE) for the model determined by a random-Forest algorithm. Blue ellipses tilted right represent positive correlations, whereas negative correlations are represented by red ellipses tilted left. Black points represent statistically significant correlations (q-value <0.05).
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4585024&req=5

Figure 7: Associations of the assembly statistics and the alpha diversity metrics. Correlation plot representing all possible associations between the assembly estimators and the alpha diversity estimators. Black dots represent Spearman’s rank correlation with statistical significance. Green boxes represents all the assembly statistics that were significant for the generalized linear model that predicted the alpha diversity index, the red stars represent variables that most increase the mean square error (MSE) for the model determined by a random-Forest algorithm. Blue ellipses tilted right represent positive correlations, whereas negative correlations are represented by red ellipses tilted left. Black points represent statistically significant correlations (q-value <0.05).
Mentions: Figure 7 resumes the associations predicted by the three regressions methods. The GLM method predicted more associations between the diversity estimators than the other two, suggesting most parameters may have an influence on the diversity. The other two, the percentage of increment of the mean square error (%IncMSE) produced by Random Forest, and the significance Spearman correlation p-values (SCp), were more conservative in establishing associations between the alpha diversity estimators and assembly parameters, avoiding most of the chimeric contigs.

Bottom Line: Alpha diversity calculated from contigs as OTUs resulted in significantly higher values for all assemblies when compared with actual species distribution, showing an overestimation due to the increased predicted abundance.Conversely, using PHACCS resulted in lower values for all assemblers.Using contigs for calculating alpha diversity result in overestimation but it is usually the only approach available.

View Article: PubMed Central - PubMed

Affiliation: Área de Genómica y Salud, Fundación para el Fomento de la Investigación Sanitaria y Biomédica de la Comunidad Valenciana (FISABIO)-Salud Pública , Valencia , Spain ; Institut Cavanilles de Biodiversitat i Biologia Evolutiva, Universitat de València , Paterna , Spain ; Consorcio de Investigación Biomédica en Red especializado en Epidemiología y Salud Pública (CIBERESP) , Madrid , Spain.

ABSTRACT
Metagenomic libraries consist of DNA fragments from diverse species, with varying genome size and abundance. High-throughput sequencing platforms produce large volumes of reads from these libraries, which may be assembled into contigs, ideally resembling the original larger genomic sequences. The uneven species distribution, along with the stochasticity in sample processing and sequencing bias, impacts the success of accurate sequence assembly. Several assemblers enable the processing of viral metagenomic data de novo, generally using overlap layout consensus or de Bruijn graph approaches for contig assembly. The success of viral genomic reconstruction in these datasets is limited by the degree of fragmentation of each genome in the sample, which is dependent on the sequencing effort and the genome length. Depending on ecological, biological, or procedural biases, some fragments have a higher prevalence, or coverage, in the assembly. However, assemblers must face challenges, such as the formation of chimerical structures and intra-species variability. Diversity calculation relies on the classification of the sequences that comprise a metagenomic dataset. Whenever the corresponding genomic and taxonomic information is available, contigs matching the same species can be classified accordingly and the coverage of its genome can be calculated for that species. This may be used to compare populations by estimating abundance and assessing species distribution from this data. Nevertheless, the coverage does not take into account the degree of fragmentation, or else genome completeness, and is not necessarily representative of actual species distribution in the samples. Furthermore, undetermined sequences are abundant in viral metagenomic datasets, resulting in several independent contigs that cannot be assigned by homology or genomic information. These may only be classified as different operational taxonomic units (OTUs), sometimes remaining inadvisably unrelated. Thus, calculations using contigs as different OTUs ultimately overestimate diversity when compared to diversity calculated from species coverage. In order to compare the effect of coverage and fragmentation, we generated three sets of simulated Illumina paired-end reads with different sequencing depths. We compared different assemblies performed with RayMeta, CLC Assembly Cell, MEGAHIT, SPAdes, Meta-IDBA, SOAPdenovo, Velvet, Metavelvet, and MIRA with the best attainable assemblies for each dataset (formed by arranging data using known genome coordinates) by calculating different assembly statistics. A new fragmentation score was included to estimate the degree of genome fragmentation of each taxon and adjust the coverage accordingly. The abundance in the metagenome was compared by bootstrapping the assembly data and hierarchically clustering them with the best possible assembly. Additionally, richness and diversity indexes were calculated for all the resulting assemblies and were assessed under two distributions: contigs as independent OTUs and sequences classified by species. Finally, we search for the strongest correlations between the diversity indexes and the different assembly statistics. Although fragmentation was dependent of genome coverage, it was not as heavily influenced by the assembler. The sequencing depth was the predominant attractor that influenced the success of the assemblies. The coverage increased notoriously in larger datasets, whereas fragmentation values remained lower and unsaturated. While still far from obtaining the ideal assemblies, the RayMeta, SPAdes, and the CLC assemblers managed to build the most accurate contigs with larger datasets while Meta-IDBA showed a good performance with the medium-sized dataset, even after the adjusted coverage was calculated. Their resulting assemblies showed the highest coverage scores and the lowest fragmentation values. Alpha diversity calculated from contigs as OTUs resulted in significantly higher values for all assemblies when compared with actual species distribution, showing an overestimation due to the increased predicted abundance. Conversely, using PHACCS resulted in lower values for all assemblers. Different association methods (random-forest, generalized linear models, and the Spearman correlation index) support the number of contigs, the coverage, and fragmentation as the assembly parameters that most affect the estimation of the alpha diversity. Coverage calculations may provide an insight into relative completeness of a genome but they overlook missing fragments or overly separated sequences in a genome. The assembly of a highly fragmented genomes with high coverage may still lead to the clustering of different OTUs that are actually different fragments of a genome. Thus, it proves useful to penalize coverage with a fragmentation score. Using contigs for calculating alpha diversity result in overestimation but it is usually the only approach available. Still, it is enough for sample comparison. The best approach may be determined by choosing the assembler that better fits the sequencing depth and adjusting the parameters for longer accurate contigs whenever possible whereas diversity may be calculated considering taxonomical and genomic information if available.

No MeSH data available.


Related in: MedlinePlus