Limits...
Multivariate Cutoff Level Analysis (MultiCoLA) of large community data sets.

Gobet A, Quince C, Ramette A - Nucleic Acids Res. (2010)

Bottom Line: Yet this question has neither been satisfactorily addressed, nor is the impact of such definition on data set structure and interpretation been fully evaluated.Here we propose a strategy, MultiCoLA (Multivariate Cutoff Level Analysis), to systematically assess the impact of various abundance or rarity cutoff levels on the resulting data set structure and on the consistency of the further ecological interpretation.Future applications can be foreseen for data sets from different types of habitats, e.g. other marine environments, soil and human microbiota.

View Article: PubMed Central - PubMed

Affiliation: Microbial Habitat Group, Max Planck Institute for Marine Microbiology, Bremen, Germany.

ABSTRACT
High-throughput sequencing techniques are becoming attractive to molecular biologists and ecologists as they provide a time- and cost-effective way to explore diversity patterns in environmental samples at an unprecedented resolution. An issue common to many studies is the definition of what fractions of a data set should be considered as rare or dominant. Yet this question has neither been satisfactorily addressed, nor is the impact of such definition on data set structure and interpretation been fully evaluated. Here we propose a strategy, MultiCoLA (Multivariate Cutoff Level Analysis), to systematically assess the impact of various abundance or rarity cutoff levels on the resulting data set structure and on the consistency of the further ecological interpretation. We applied MultiCoLA to a 454 massively parallel tag sequencing data set of V6 ribosomal sequences from marine microbes in temperate coastal sands. Consistent ecological patterns were maintained after removing up to 35-40% rare sequences and similar patterns of beta diversity were observed after denoising the data set by using a preclustering algorithm of 454 flowgrams. This example validates the importance of exploring the impact of the definition of rarity in large community data sets. Future applications can be foreseen for data sets from different types of habitats, e.g. other marine environments, soil and human microbiota.

Show MeSH
MultiCoLA steps. After truncating the original table according to various abundance cutoff levels, the effects of specific rarity definitions are tested by applying three types of analyses: (1) Variations in data set structure are established based on non-parametric correlations of pairwise distance matrices (e.g. calculated with the Bray–Curtis coefficient). (2) The amounts of extracted community variation (using NMDS) from the original data and the truncated data sets are compared by Procrustes correlations. (3) When additional parameters are available, the biological variation that can be explained by environmental parameters in the original and in the truncated data sets are then systematically compared. D, dominant OTUs; R, rare OTUs.
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2926624&req=5

Figure 2: MultiCoLA steps. After truncating the original table according to various abundance cutoff levels, the effects of specific rarity definitions are tested by applying three types of analyses: (1) Variations in data set structure are established based on non-parametric correlations of pairwise distance matrices (e.g. calculated with the Bray–Curtis coefficient). (2) The amounts of extracted community variation (using NMDS) from the original data and the truncated data sets are compared by Procrustes correlations. (3) When additional parameters are available, the biological variation that can be explained by environmental parameters in the original and in the truncated data sets are then systematically compared. D, dominant OTUs; R, rare OTUs.

Mentions: Two approaches may be applied to truncate the original data set when removing an increasing proportion of rare types: either the whole data set is considered or each sample is considered individually (Figure 1). Because there is no reason to a priori choose a given threshold value, various cutoffs need to be systematically applied to explore their effects. The resulting, truncated data sets are then evaluated at three levels: first, the data sets are converted to sample-by-sample dissimilarity matrices (e.g. here we used the Bray–Curtis coefficient to calculate the dissimilarity between samples but other dissimilarity coefficients may be used) and those matrices are compared with the matrix produced by the whole dataset using non-parametric Spearman correlations (Figure 2), so as to assess changes in data structure. Second, the amounts of extracted ecological variation, obtained by the application of the NMDS ordination, in the truncated and original data sets are compared by Procrustes rotation (i.e. a measure of the correlation between two ordination solutions). Third, when contextual parameters (e.g. space, time or environment) are available, it is possible to systematically compare the ecological interpretation of each truncated data set with that of the original data set. This is achieved by partitioning the biological variation from the different truncated data sets as a function of explanatory variables (Materials and Methods section).Figure 2.


Multivariate Cutoff Level Analysis (MultiCoLA) of large community data sets.

Gobet A, Quince C, Ramette A - Nucleic Acids Res. (2010)

MultiCoLA steps. After truncating the original table according to various abundance cutoff levels, the effects of specific rarity definitions are tested by applying three types of analyses: (1) Variations in data set structure are established based on non-parametric correlations of pairwise distance matrices (e.g. calculated with the Bray–Curtis coefficient). (2) The amounts of extracted community variation (using NMDS) from the original data and the truncated data sets are compared by Procrustes correlations. (3) When additional parameters are available, the biological variation that can be explained by environmental parameters in the original and in the truncated data sets are then systematically compared. D, dominant OTUs; R, rare OTUs.
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2926624&req=5

Figure 2: MultiCoLA steps. After truncating the original table according to various abundance cutoff levels, the effects of specific rarity definitions are tested by applying three types of analyses: (1) Variations in data set structure are established based on non-parametric correlations of pairwise distance matrices (e.g. calculated with the Bray–Curtis coefficient). (2) The amounts of extracted community variation (using NMDS) from the original data and the truncated data sets are compared by Procrustes correlations. (3) When additional parameters are available, the biological variation that can be explained by environmental parameters in the original and in the truncated data sets are then systematically compared. D, dominant OTUs; R, rare OTUs.
Mentions: Two approaches may be applied to truncate the original data set when removing an increasing proportion of rare types: either the whole data set is considered or each sample is considered individually (Figure 1). Because there is no reason to a priori choose a given threshold value, various cutoffs need to be systematically applied to explore their effects. The resulting, truncated data sets are then evaluated at three levels: first, the data sets are converted to sample-by-sample dissimilarity matrices (e.g. here we used the Bray–Curtis coefficient to calculate the dissimilarity between samples but other dissimilarity coefficients may be used) and those matrices are compared with the matrix produced by the whole dataset using non-parametric Spearman correlations (Figure 2), so as to assess changes in data structure. Second, the amounts of extracted ecological variation, obtained by the application of the NMDS ordination, in the truncated and original data sets are compared by Procrustes rotation (i.e. a measure of the correlation between two ordination solutions). Third, when contextual parameters (e.g. space, time or environment) are available, it is possible to systematically compare the ecological interpretation of each truncated data set with that of the original data set. This is achieved by partitioning the biological variation from the different truncated data sets as a function of explanatory variables (Materials and Methods section).Figure 2.

Bottom Line: Yet this question has neither been satisfactorily addressed, nor is the impact of such definition on data set structure and interpretation been fully evaluated.Here we propose a strategy, MultiCoLA (Multivariate Cutoff Level Analysis), to systematically assess the impact of various abundance or rarity cutoff levels on the resulting data set structure and on the consistency of the further ecological interpretation.Future applications can be foreseen for data sets from different types of habitats, e.g. other marine environments, soil and human microbiota.

View Article: PubMed Central - PubMed

Affiliation: Microbial Habitat Group, Max Planck Institute for Marine Microbiology, Bremen, Germany.

ABSTRACT
High-throughput sequencing techniques are becoming attractive to molecular biologists and ecologists as they provide a time- and cost-effective way to explore diversity patterns in environmental samples at an unprecedented resolution. An issue common to many studies is the definition of what fractions of a data set should be considered as rare or dominant. Yet this question has neither been satisfactorily addressed, nor is the impact of such definition on data set structure and interpretation been fully evaluated. Here we propose a strategy, MultiCoLA (Multivariate Cutoff Level Analysis), to systematically assess the impact of various abundance or rarity cutoff levels on the resulting data set structure and on the consistency of the further ecological interpretation. We applied MultiCoLA to a 454 massively parallel tag sequencing data set of V6 ribosomal sequences from marine microbes in temperate coastal sands. Consistent ecological patterns were maintained after removing up to 35-40% rare sequences and similar patterns of beta diversity were observed after denoising the data set by using a preclustering algorithm of 454 flowgrams. This example validates the importance of exploring the impact of the definition of rarity in large community data sets. Future applications can be foreseen for data sets from different types of habitats, e.g. other marine environments, soil and human microbiota.

Show MeSH