Novel methods and application of Bayesian hierarchical regression models

Open Access
- Author:
- Zhang, Amy
- Graduate Program:
- Statistics
- Degree:
- Doctor of Philosophy
- Document Type:
- Dissertation
- Date of Defense:
- July 29, 2021
- Committee Members:
- David Hunter, Major Field Member
Xiaoyue Niu, Major Field Member
C Lee Giles, Outside Unit & Field Member
Le Bao, Chair & Dissertation Advisor
Ephraim Mont Hanks, Program Head/Chair
Michael J. Daniels, Special Member - Keywords:
- Bayesian models
Bayesian hierarchical regression
Approximate cross-validation
Information pooling
Statistics - Abstract:
- We present two novel methods and one application study for Bayesian hierarchical regression models (BHRMs). BHRMs are a large, flexible class of probabilistic models and range in complexity from one-way linear regressions to Gaussian Markov random fields. We demonstrate their efficacy on an application study of HIV prevalence in certain key sub-populations that are particularly vulnerable to HIV. The data are heavily imbalanced and often have small sample sizes. We show that a BHRM which pools information across key populations to model the HIV prevalence trend over time reduces predictive error in comparison to modeling the HIV prevalence trend separately within key populations. We compare different levels of pooling, with partial pooling providing the greatest flexibility while also maintaining a low predictive error under leave-a-cluster-out cross-validation. Although information pooling is often credited for the good performance of Bayesian hierarchical models, it has not been explicitly quantified except in the simplest of one-way settings. We propose a novel method which explicitly quantifies information pooling and shrinkage for all regression models, via what we call the borrowing factors. The borrowing factors decompose regression model estimates into weights placed on clusters of data. The weights are informed only by the model definition and data availability and thus can be used to explicitly link the effects of data imbalance and model assumptions to actual model estimates. We also provide a metric to identify point estimates which rely heavily on specific clusters of data, called SSBF. We present theoretical properties of SSBF and the borrowing factors and demonstrate their usage on two examples. We next present an extension of the borrowing factors to cross-validation. BHRMs can be computationally expensive to fit. Cross-validation (CV) is therefore not a common practice to evaluate the predictive performance of BHRMs. We present a novel method which circumvents the need to re-run computationally costly estimation methods for each cross-validation fold and makes CV more feasible for large BHRMs. By conditioning on the estimated variance-covariance parameters using all the data, we shift the CV problem from probability-based sampling to a simple and familiar optimization problem. In many cases, this produces estimates which are equivalent to full CV. We provide theoretical results, a diagnostic for finite sample sizes, and demonstrate its efficacy on publicly available data.