Investigation of topics in U-statistics and their applications in risk estimation and cross-validation

Open Access
Wang, Qing
Graduate Program:
Doctor of Philosophy
Document Type:
Date of Defense:
May 03, 2012
Committee Members:
  • Bruce G. Lindsay, Dissertation Advisor
  • Bruce G. Lindsay, Committee Chair
  • Naomi S Altman, Committee Member
  • David Russell Hunter, Committee Member
  • Rongling Wu, Committee Member
  • cross-validation
  • kernel estimator
  • likelihood risk
  • model selection
  • partition resampling
  • variance estimation
  • U-statistics
The primary goal of my dissertation has been to develop new methods, including theory and practical implementation, in the area of U-statistics. This area is quite old, with many important results first appearing in Hoeffding (1948). There have been many applications of U-statistics in nonparametric statistics. One area that is quite modern and active is cross-validation and risk estimation, although it has not traditionally been thought of as a U-statistic area. The application of my research has been focused on this area. The first objective of my research is to devise the best unbiased variance estimator for a general U-statistic. It can be written as a quadratic form of the kernel function and is applicable as long as the kernel size k<=n/2. In addition, it can be represented as a familiar ANOVA form as a contrast of between-class and within-class variation. As a further step to make the proposed variance estimator more practical, we developed a partition resampling scheme that can be used to realize the U-statistic and its variance estimator simultaneously with high computational efficiency. We then turn our attention to the implementation of U-statistics in risk estimation in the context of the nonparametric kernel density estimator. We propose to construct a U-statistic form estimate for the risk that arises from L2 and Kullback-Leibler distance respectively. In addition, we consider a two-stage, "subsampling+extrapolation", bandwidth selection procedure which can help to reduce the variability of the conventional cross-validation bandwidth selector dramatically. It is equivalent to Hall and Robinson's (2009) rescaled "bagging cross-validation" bandwidth selector if one sets the fictional sample size equal to the bootstrap size. However, the simple form for our U-statistic risk estimator enables us to calculate the aggregated risk much more efficiently than bootstrapping. Moreover, a real data example in the context of model selection is considered. We construct a U-statistic cross-validation tool, akin to the BIC criterion for model selection. The U-estimator for the likelihood risk is more generally applicable than the AIC and BIC methods. In addition, with our proposed variance estimator for a general U-statistic we can test which model has the smallest risk. Finally, we will explore extrapolation and interpolation techniques with applications in bandwidth selection, variance estimation, and quantile estimation. Some preliminary results will be discussed in the end of the dissertation.