INFORMATION-THEORETIC APPROACHES FOR COMBINED MODELING OF QUALITATIVE AND QUANTITATIVE DATA
Open Access
- Author:
- Salaka, Vamsi
- Graduate Program:
- Industrial Engineering
- Degree:
- Doctor of Philosophy
- Document Type:
- Dissertation
- Date of Defense:
- May 05, 2008
- Committee Members:
- Dennis Kon Jin Lin, Committee Member
Richard Allen Wysk, Committee Member
Vittaldas V Prabhu, Committee Chair/Co-Chair
Meg Leavy Small, Committee Member - Keywords:
- ontology
Data Mining
Text Mining
Information theory
vector space modeling - Abstract:
- In the past decade, remarkable progress has been made in developing data mining techniques limited to analyzing quantitative data along with text mining techniques limited to analyzing qualitative data. However, there has been limited effort in developing techniques that can handle combined qualitative and quantitative data. The hypothesis of this research is that the analysis of combined qualitative and quantitative data would result in richer information by uncovering new insights which may otherwise have been neglected. The key contributions from this work are: 1.) A Methodology for Combined Analysis: A methodology to combine qualitative and quantitative data is proposed 2.) Information-Theoretic Metrics: Information theoretic metrics to measure the value of combining qualitative and quantitative data is proposed 3.) Information Description Framework: A novel representation for unifying combined qualitative and quantitative data, a conceptual model, and information from the combined analysis is developed. 4.) Case Studies: The successful application of the current research to two real world case studies from diverse domains is presented. The proposed information theoretic metrics which measure the value of combined analysis are adapted from information theory. Specifically, the concept of information gain is utilized to capture the expected reduction in uncertainty by combining qualitative and quantitative data, the concept of mutual information is utilized to capture the concurrence of qualitative data with quantitative data, and the concept of conditional entropy is utilized to capture the reduction in conditional uncertainty, which measures the benefit derived by individual qualitative or quantitative data in a combined analysis. “IDF” is a representation for combined qualitative and quantitative data, a conceptual model, and the information from combined analysis. IDF enables a way to share results obtained from various statistical analysis and data mining techniques in a seamless way among analysts. Semantic Bayesian Networks is an instance of IDF that was developed to represent information from a combined analysis using Bayesian Networks to model quantitative data and vector space models for qualitative data. In the methodology for combined analysis, “Information Extraction” techniques are used to convert text into structured data by extracting useful patterns. This research only deals with text specific to a domain and thereby takes advantage of the rich terminology and practices in defining patterns. “Vector Space Modeling”—a text mining technique—is used to analyze information from structured textual data. The application of the current research to two case studies is presented. The first case study is a customer survey in a restaurant chain to improve the “quality of service”. This case study is used to demonstrate the concepts of combined analysis and information theoretic metrics. The second case study involves monitoring the implementation quality of preventive mental health programs in schools. In this case study, different ways are identified to assist researchers and school administrators to automate some of their Social and Emotional Programs (SEL) to improve productivity maintaining quality of implementation. In addition, in this case study, a software system was developed to capture combined qualitative and quantitative data in a more unified manner. This second case study demonstrates the successful application of the proposed combined analysis methodology, information description framework, and vector space models. Specifically, the results from the case studies demonstrate a 20% information gain the identification of missing variables, triangulated research methods to enhance confidence, and the provision of feedback to improve the quantitative questionnaire.. The current research work can find applications in various domains such as in the service industry, manufacturing quality, market research, financial markets, sports, and entertainment where qualitative and quantitative data is currently independently analyzed.