Two Topics: A Jackknife Maximum Likelihood Approach to Statistical Model Selection and a Convex Hull Peeling Depth Approach to Nonparametric Massive Multivariate Data Analysis with Applications

Open Access
Author:
Lee, Hyunsook
Graduate Program:
Statistics
Degree:
Doctor of Philosophy
Document Type:
Dissertation
Date of Defense:
June 16, 2006
Committee Members:
  • G Jogesh Babu, Committee Chair
  • Thomas P Hettmansperger, Committee Member
  • Bing Li, Committee Member
  • William Kenneth Jenkins, Committee Member
  • James Landis Rosenberger, Committee Member
Keywords:
  • Kullback-Leibler distance
  • information criterion
  • statistical model selection
  • jackknife
  • jackknife information criterion
  • bias reduction
  • maximum likelihood
  • unbiased estimation
  • non-nested model
  • computational geometry
  • convex hull
  • convex hull peeling
  • statistical data depth
  • nonparametric statistics
  • generalized quantile process
  • descriptive statistics
  • skewness
  • kurtosis
  • volume functional
  • convex hull level set
  • massive data
  • outlier detection
  • balloon plot
  • multivariate analysis
Abstract:
This dissertation presents two topics from opposite disciplines: one is from a parametric realm and the other is based on nonparametric methods. The first topic is a jackknife maximum likelihood approach to statistical model selection and the second one is a convex hull peeling depth approach to nonparametric massive multivariate data analysis. The second topic includes simulations and applications on massive astronomical data. First, we present a model selection criterion, minimizing the Kullback-Leibler distance by using the jackknife method. Various model selection methods have been developed to choose a model of minimum Kullback-Liebler distance to the true model, such as Akaike information criterion (AIC), Bayesian information criterion (BIC), Minimum description length (MDL), and Bootstrap information criterion. Likewise, the jackknife method chooses a model of minimum Kullback-Leibler distance through bias reduction. This bias, which is inevitable in model selection problems, arise from estimating the distance between an unknown true model and an estimated model. We show that (a) the jackknife maximum likelihood estimator is consistent to the parameter of interest, (b) the jackknife estimate of the log likelihood is asymptotically unbiased, and (c) the stochastic order of the jackknife log likelihood estimate is $O(log log n).$ Because of these properties, the jackknife information criterion is applicable to problems of choosing a model from non nested candidates especially when the true model is unknown. Compared to popular information criteria which are only applicable to nested models, the jackknife information criterion is more robust in terms of filtering various types of candidate models to choose the best approximating model. However, this robust method has a demerit that the jackknife criterion is unable to discriminate nested models. Next, we explore the convex hull peeling process to develop empirical tools for statistical inferences on multivariate massive data. Convex hull and its peeling process has intuitive appeals for robust location estimation. We define the convex hull peeling depth, which enables to order multivariate data. This ordering process based on data depth provides ways to obtain multivariate quantiles including median. Based on the generalized quantile process, we define a convex hull peeling central region, a convex hull level set, and a volume functional, which lead us to invent one dimensional mappings, describing shapes of multivariate distributions along data depth. We define empirical skewness and kurtosis measures based on the convex hull peeling process. In addition to these empirical descriptive statistics, we find a few methodologies to find multivariate outliers in massive data sets. Those outlier detection algorithms are (1) estimating multivariate quantiles up to the level $alpha$, (2) detecting changes in a measure sequence of convex hull level sets, and (3) constructing a balloon to exclude outliers. The convex hull peeling depth is a robust estimator so that the existence of outliers do not affect properties of inner convex hull level sets. Overall, we show all these good characteristics of the convex hull peeling process through bivariate synthetic data sets to illustrate the procedures. We prove these empirical procedures are applicable to real massive data set by employing Quasars and galaxies from Sloan Digital Sky Survey. Interesting scientific results from the convex hull peeling multivariate data analysis are also provided.