Limits...
Assessment and Selection of Competing Models for Zero-Inflated Microbiome Data.

Xu L, Paterson AD, Turpin W, Xu W - PLoS ONE (2015)

Bottom Line: We examine varying degrees of zero inflation, with or without dispersion in the count component, as well as different magnitude and direction of the covariate effect on structural zeros and the count components.We focus on the assessment of type I error, power to detect the overall covariate effect, measures of model fit, and bias and effectiveness of parameter estimations.We then discuss the model selection strategy for zero inflated data and implement it in a gut microbiome study of > 400 independent subjects.

View Article: PubMed Central - PubMed

Affiliation: Dalla Lana School of Public Health, University of Toronto, ON, M5T 3M7, Canada.

ABSTRACT
Typical data in a microbiome study consist of the operational taxonomic unit (OTU) counts that have the characteristic of excess zeros, which are often ignored by investigators. In this paper, we compare the performance of different competing methods to model data with zero inflated features through extensive simulations and application to a microbiome study. These methods include standard parametric and non-parametric models, hurdle models, and zero inflated models. We examine varying degrees of zero inflation, with or without dispersion in the count component, as well as different magnitude and direction of the covariate effect on structural zeros and the count components. We focus on the assessment of type I error, power to detect the overall covariate effect, measures of model fit, and bias and effectiveness of parameter estimations. We also evaluate the abilities of model selection strategies using Akaike information criterion (AIC) or Vuong test to identify the correct model. The simulation studies show that hurdle and zero inflated models have well controlled type I errors, higher power, better goodness of fit measures, and are more accurate and efficient in the parameter estimation. Besides that, the hurdle models have similar goodness of fit and parameter estimation for the count component as their corresponding zero inflated models. However, the estimation and interpretation of the parameters for the zero components differs, and hurdle models are more stable when structural zeros are absent. We then discuss the model selection strategy for zero inflated data and implement it in a gut microbiome study of > 400 independent subjects.

No MeSH data available.


The estimate of β1 (or ) and its standard error for data simulated under ZINB when ϕc = 20% and γ1 = 0.4.The figure displays box-plots of estimates and their standard errors for the covariate effect on the log-odds of structural zeros for ZIP and ZINB method and on the log-odds of zeros for hurdle models from 1000 replications when γ1 = 0.4. For each box of the boxplots, the center line represents the median, the bottom line represents the 25th percentiles and the top line represents the 75th percentiles. The whiskers of the boxplots show 1.5 interquartile range (IQR) below the 25th percentiles and 1.5 IQR above the 75th percentiles, and outliers are represented by small circles. Panels (A1), (C1) and (E1) show the estimates of β1 for consonant, neutral and dissonant effect case, respectively. The horizontal line in these panels represents the true value of β1, which is −0.349 in (A1), 0 in (C1) and 0.287 in (E1). Panels (A2), (C2) and (E2) show the estimates of  for consonant, neutral and dissonant effect case, respectively. The horizontal line in these panels represents the true value of , which is −0.420 in (A2), −0.240 in (C2) and −0.070 in (E2). The bias, standard deviation (SD), and root mean square error (RMSE) of the estimates are shown above the box-plot for each method. Panel (B1), (D1) and (F1) show the SEs of the estimates for β1, and panel (B2), (D2) and (F2) show the SEs of the estimates for . The mean and standard deviation (SD) of the standard error (SE) estimations are shown above the box-plot for each method.
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4493133&req=5

pone.0129606.g007: The estimate of β1 (or ) and its standard error for data simulated under ZINB when ϕc = 20% and γ1 = 0.4.The figure displays box-plots of estimates and their standard errors for the covariate effect on the log-odds of structural zeros for ZIP and ZINB method and on the log-odds of zeros for hurdle models from 1000 replications when γ1 = 0.4. For each box of the boxplots, the center line represents the median, the bottom line represents the 25th percentiles and the top line represents the 75th percentiles. The whiskers of the boxplots show 1.5 interquartile range (IQR) below the 25th percentiles and 1.5 IQR above the 75th percentiles, and outliers are represented by small circles. Panels (A1), (C1) and (E1) show the estimates of β1 for consonant, neutral and dissonant effect case, respectively. The horizontal line in these panels represents the true value of β1, which is −0.349 in (A1), 0 in (C1) and 0.287 in (E1). Panels (A2), (C2) and (E2) show the estimates of for consonant, neutral and dissonant effect case, respectively. The horizontal line in these panels represents the true value of , which is −0.420 in (A2), −0.240 in (C2) and −0.070 in (E2). The bias, standard deviation (SD), and root mean square error (RMSE) of the estimates are shown above the box-plot for each method. Panel (B1), (D1) and (F1) show the SEs of the estimates for β1, and panel (B2), (D2) and (F2) show the SEs of the estimates for . The mean and standard deviation (SD) of the standard error (SE) estimations are shown above the box-plot for each method.

Mentions: Fig 7 shows the boxplots of estimations and their SEs for β1 using ZI models and for using hurdle models when ZINB simulated data has 20% zero inflation in the unexposed group and γ1 = 0.4. The true values of are derived from the parameter estimations of the ZI model using Equation 1 and 2. Results for other simulation settings are shown in S4 Fig, S5 Fig, S6 Fig, S7 Fig. Because the logistic regression part is the same, the estimations for are identical across different hurdle models. Similarly to the case of γ1, ZINB has unbiased estimation of β1 for both ZIP and ZINB distributed data, while ZIP is only unbiased for ZIP distributed data. Notice that when the proportion of zero inflation is low (e.g., ϕc = 20%), ZINB may have unstable results with some large SE. The estimations are more stable when the zero inflation proportion increases to 50% or when the sample size is increased (results not given). On the contrary, hurdle models give unbiased and stable estimates for in all simulation scenarios.


Assessment and Selection of Competing Models for Zero-Inflated Microbiome Data.

Xu L, Paterson AD, Turpin W, Xu W - PLoS ONE (2015)

The estimate of β1 (or ) and its standard error for data simulated under ZINB when ϕc = 20% and γ1 = 0.4.The figure displays box-plots of estimates and their standard errors for the covariate effect on the log-odds of structural zeros for ZIP and ZINB method and on the log-odds of zeros for hurdle models from 1000 replications when γ1 = 0.4. For each box of the boxplots, the center line represents the median, the bottom line represents the 25th percentiles and the top line represents the 75th percentiles. The whiskers of the boxplots show 1.5 interquartile range (IQR) below the 25th percentiles and 1.5 IQR above the 75th percentiles, and outliers are represented by small circles. Panels (A1), (C1) and (E1) show the estimates of β1 for consonant, neutral and dissonant effect case, respectively. The horizontal line in these panels represents the true value of β1, which is −0.349 in (A1), 0 in (C1) and 0.287 in (E1). Panels (A2), (C2) and (E2) show the estimates of  for consonant, neutral and dissonant effect case, respectively. The horizontal line in these panels represents the true value of , which is −0.420 in (A2), −0.240 in (C2) and −0.070 in (E2). The bias, standard deviation (SD), and root mean square error (RMSE) of the estimates are shown above the box-plot for each method. Panel (B1), (D1) and (F1) show the SEs of the estimates for β1, and panel (B2), (D2) and (F2) show the SEs of the estimates for . The mean and standard deviation (SD) of the standard error (SE) estimations are shown above the box-plot for each method.
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4493133&req=5

pone.0129606.g007: The estimate of β1 (or ) and its standard error for data simulated under ZINB when ϕc = 20% and γ1 = 0.4.The figure displays box-plots of estimates and their standard errors for the covariate effect on the log-odds of structural zeros for ZIP and ZINB method and on the log-odds of zeros for hurdle models from 1000 replications when γ1 = 0.4. For each box of the boxplots, the center line represents the median, the bottom line represents the 25th percentiles and the top line represents the 75th percentiles. The whiskers of the boxplots show 1.5 interquartile range (IQR) below the 25th percentiles and 1.5 IQR above the 75th percentiles, and outliers are represented by small circles. Panels (A1), (C1) and (E1) show the estimates of β1 for consonant, neutral and dissonant effect case, respectively. The horizontal line in these panels represents the true value of β1, which is −0.349 in (A1), 0 in (C1) and 0.287 in (E1). Panels (A2), (C2) and (E2) show the estimates of for consonant, neutral and dissonant effect case, respectively. The horizontal line in these panels represents the true value of , which is −0.420 in (A2), −0.240 in (C2) and −0.070 in (E2). The bias, standard deviation (SD), and root mean square error (RMSE) of the estimates are shown above the box-plot for each method. Panel (B1), (D1) and (F1) show the SEs of the estimates for β1, and panel (B2), (D2) and (F2) show the SEs of the estimates for . The mean and standard deviation (SD) of the standard error (SE) estimations are shown above the box-plot for each method.
Mentions: Fig 7 shows the boxplots of estimations and their SEs for β1 using ZI models and for using hurdle models when ZINB simulated data has 20% zero inflation in the unexposed group and γ1 = 0.4. The true values of are derived from the parameter estimations of the ZI model using Equation 1 and 2. Results for other simulation settings are shown in S4 Fig, S5 Fig, S6 Fig, S7 Fig. Because the logistic regression part is the same, the estimations for are identical across different hurdle models. Similarly to the case of γ1, ZINB has unbiased estimation of β1 for both ZIP and ZINB distributed data, while ZIP is only unbiased for ZIP distributed data. Notice that when the proportion of zero inflation is low (e.g., ϕc = 20%), ZINB may have unstable results with some large SE. The estimations are more stable when the zero inflation proportion increases to 50% or when the sample size is increased (results not given). On the contrary, hurdle models give unbiased and stable estimates for in all simulation scenarios.

Bottom Line: We examine varying degrees of zero inflation, with or without dispersion in the count component, as well as different magnitude and direction of the covariate effect on structural zeros and the count components.We focus on the assessment of type I error, power to detect the overall covariate effect, measures of model fit, and bias and effectiveness of parameter estimations.We then discuss the model selection strategy for zero inflated data and implement it in a gut microbiome study of > 400 independent subjects.

View Article: PubMed Central - PubMed

Affiliation: Dalla Lana School of Public Health, University of Toronto, ON, M5T 3M7, Canada.

ABSTRACT
Typical data in a microbiome study consist of the operational taxonomic unit (OTU) counts that have the characteristic of excess zeros, which are often ignored by investigators. In this paper, we compare the performance of different competing methods to model data with zero inflated features through extensive simulations and application to a microbiome study. These methods include standard parametric and non-parametric models, hurdle models, and zero inflated models. We examine varying degrees of zero inflation, with or without dispersion in the count component, as well as different magnitude and direction of the covariate effect on structural zeros and the count components. We focus on the assessment of type I error, power to detect the overall covariate effect, measures of model fit, and bias and effectiveness of parameter estimations. We also evaluate the abilities of model selection strategies using Akaike information criterion (AIC) or Vuong test to identify the correct model. The simulation studies show that hurdle and zero inflated models have well controlled type I errors, higher power, better goodness of fit measures, and are more accurate and efficient in the parameter estimation. Besides that, the hurdle models have similar goodness of fit and parameter estimation for the count component as their corresponding zero inflated models. However, the estimation and interpretation of the parameters for the zero components differs, and hurdle models are more stable when structural zeros are absent. We then discuss the model selection strategy for zero inflated data and implement it in a gut microbiome study of > 400 independent subjects.

No MeSH data available.