Data-adaptive Approaches to Modeling Propensity Scores in Causal Inference Problems

Open Access
- Author:
- Zhu, Yeying
- Graduate Program:
- Statistics
- Degree:
- Doctor of Philosophy
- Document Type:
- Dissertation
- Date of Defense:
- May 17, 2013
- Committee Members:
- Debashis Ghosh, Dissertation Advisor/Co-Advisor
Debashis Ghosh, Committee Chair/Co-Chair
Aleksandra B Slavkovic, Committee Member
Runze Li, Committee Member
Donna Coffman, Committee Member - Keywords:
- causal inference
data-adaptive
data mining
observational studies
propensity scores - Abstract:
- In most nonrandomized observational studies, differences between treatment groups may arise not only due to the treatment but also because of the effect of confounders. Therefore, causal inferences regarding the treatment effect are not as straightforward as in a randomized trial. To adjust for confounding due to measured covariates, the average treatment effect is often estimated by using propensity scores. The objective of this thesis is to develop innovative methods for estimating propensity scores in three different causal inference problems. In the first part, we focus on the inverse weighted estimation of causal effect for a binary treatment. We propose a data-adaptive approach that combines parametric and nonparametric machine learning algorithms for estimating propensity scores. Some theoretical results regarding consistency of the procedure are given. Simulation studies are used to assess the performance of the newly proposed methods relative to existing methods, and a data analysis example from the Surveillance, Epidemiology and End Results (SEER) database is presented. In the second part, we extend the proposed methodology to causal mediation analysis using inverse probability weighting. We show that combining machine learning algorithms (e.g., a generalized boosted model) and logistic regression to estimate propensity scores can be more accurate and efficient in estimating the controlled direct effects than using logistic regression alone. The proposed methods are general in the sense that we can combine multiple candidate models and use a cross-validation criterion to select the optimal subset of the candidate models for combining. The criterion achieves a balance between the number of models we combine and the variability of the resulting estimator. Simulation studies are conducted and a data application to the Early Dieting in Girls study is presented. In the last part, we study the causal inference problem with continuous treatments. In the continuous case, the generalized propensity score is defined as the conditional density of the treatment given covariates. When the dimension of the covariates is large, the estimation of the conditional density suffers from the curse of dimensionality. We propose an alternative approach, L2 boosting, to estimate propensity scores. In L2 boosting, an important tuning parameter is the number of trees to be generated. We propose a criterion called average absolute correlation coefficient (AACC) to determine the optimal number of trees. A weighted AIC or BIC is then used to determine the parametric form of the dose-response function (DRF). The proposed methodology is demonstrated by the Early Dieting in Girls study and the R code is provided in the Appendix.