Balance diagnostics for comparing the distribution of baseline covariates between treatment groups in propensity-score matched samples.
Bottom Line:
Conditional on the true propensity score, treated and untreated subjects have similar distributions of observed baseline covariates.Using this approach, matched sets of treated and untreated subjects with similar values of the propensity score are formed.Inferences about treatment effect made using propensity-score matching are valid only if, in the matched sample, treated and untreated subjects have similar distributions of measured baseline covariates.
View Article:
PubMed Central - PubMed
Affiliation: Institute for Clinical Evaluative Sciences, G1 06, 2075 Bayview Avenue, Toronto, Ontario, Canada M4N 3M5. peter.austin@ices.on.ca
ABSTRACT
Show MeSH
The propensity score is a subject's probability of treatment, conditional on observed baseline covariates. Conditional on the true propensity score, treated and untreated subjects have similar distributions of observed baseline covariates. Propensity-score matching is a popular method of using the propensity score in the medical literature. Using this approach, matched sets of treated and untreated subjects with similar values of the propensity score are formed. Inferences about treatment effect made using propensity-score matching are valid only if, in the matched sample, treated and untreated subjects have similar distributions of measured baseline covariates. In this paper we discuss the following methods for assessing whether the propensity score model has been correctly specified: comparing means and prevalences of baseline characteristics using standardized differences; ratios comparing the variance of continuous covariates between treated and untreated subjects; comparison of higher order moments and interactions; five-number summaries; and graphical methods such as quantile-quantile plots, side-by-side boxplots, and non-parametric density plots for comparing the distribution of baseline covariates between treatment groups. We describe methods to determine the sampling distribution of the standardized difference when the true standardized difference is equal to zero, thereby allowing one to determine the range of standardized differences that are plausible with the propensity score model having been correctly specified. We highlight the limitations of some previously used methods for assessing the adequacy of the specification of the propensity-score model. In particular, methods based on comparing the distribution of the estimated propensity score between treated and untreated subjects are uninformative. |
Related In:
Results -
Collection
License getmorefigures.php?uid=PMC3472075&req=5
Mentions: We conducted an additional set of analyses to illustrate the influence of sample size on the variability of the estimated standardized difference. We selected four continuous covariates: age, heart rate, respiratory rate, and hemoglobin. We then selected a random sample of matched pairs. We allowed the number of matched pairs to increase from 100 to 2400 in increments of 100. Within each random sample of matched pairs, we used the method described above to estimate the standard deviation of the empirical sampling distribution of the standardized difference of the mean. The relationship between the standard deviation of the empirical sampling distribution and the number of matched pairs is described in Figure 2. This figure clearly illustrates the inverse relationship between sample size and standard deviation of the empirical sampling distribution of the standardized difference of the mean. When the number of matched pairs is small, there are a wide range of standardized differences that are consistent with the propensity-score model having been correctly specified. For instance, if there were only 100 matched pairs, then the 2.5th and 97.5th percentiles for the sampling distribution of the standardized difference of the mean for hemoglobin would be approximately −0.234 and 0.234, respectively. If the number of matched pairs increased to 200, then these percentiles would change to −0.166 and 0.166, respectively. Thus, in small matched samples, moderate standardized differences could still be consistent with the propensity-score model having been correctly specified. This figure illustrates the concept that in observational studies, as in RCTs, balance is a large-sample property; moderate imbalance can be expected in small samples, even if the propensity score model has been correctly specified. |
View Article: PubMed Central - PubMed
Affiliation: Institute for Clinical Evaluative Sciences, G1 06, 2075 Bayview Avenue, Toronto, Ontario, Canada M4N 3M5. peter.austin@ices.on.ca