Statistical Models for Data Mining: General Inferences and Class Discovery in Large Databases

Open Access
Author:
Browning, John Duncan
Graduate Program:
Electrical Engineering
Degree:
Doctor of Philosophy
Document Type:
Dissertation
Date of Defense:
December 06, 2002
Committee Members:
  • David Jonathan Miller, Committee Chair
  • Nirmal Bose, Committee Member
  • John F Doherty, Committee Member
  • George Kesidis, Committee Member
  • C Lee Giles, Committee Member
Keywords:
  • Statistical models
  • EM
  • Class Discovery
  • Data Mining
Abstract:
This thesis is about the application of statistical models to data mining. Data mining involves searching for patterns in large data sets. With the introduction of cheaper storage devices with high capacity, faster communication and increasing computer power, large databases can be searched, or `mined' for correlations in the data. These databases can be created by business applications, biological applications, from work in astronomy, weather forecasting, natural language applications, speech recognition and many other areas. Typically these databases are much larger than traditional pattern recognition databases so that algorithms used on these databases must be able to scale with the data. A second identifying trait of data mining applications is missing and erroneous data. When this data is collected errors can occur during data entry or data can be missing, either randomly or deterministically. One advantage of statistical models is that they are based on a mathematical theory that enables a principled approach to missing/erroneous data. We investigate application of statistical models to two data mining tasks that have a lot of missing data. The first is collaborative filtering, which involves inference when most of the data is missing. The second application is a new problem, where some of the data comes from unknown classes that we have to discover. This problem is related to data clustering.