Study of Interval Data

Open Access
- Author:
- Zhang, Muzi
- Graduate Program:
- Statistics
- Degree:
- Doctor of Philosophy
- Document Type:
- Dissertation
- Date of Defense:
- June 03, 2022
- Committee Members:
- Hui Yang, Outside Unit & Field Member
Zhibiao Zhao, Major Field Member
Matthew Reimherr, Major Field Member
Ephraim Mont Hanks, Program Head/Chair
Lingzhou Xue, Chair of Committee
Dennis Lin, Dissertation Advisor - Keywords:
- interval data
symbolic data
segment plot
dandelion plot
response surface methodology
visualization
correlation measure - Abstract:
- Due to the rapid growth of technology, data is being generated faster than ever before. Billions of sensors are measuring signals from our daily activity and recording them as data. As a result of this blooming of information, data science has become more popular. In addition to the conventional numerical data or categorical data, new data types have gained more attention. Symbolic data was introduced to handle data with more complex formats. Compare with the traditional data table, where each cell in the table contains a single-valued data point, symbolic data deals with the cases where each cell has an entry with more internal information, such as a list, histogram, distribution or an interval. Interval data is the quantitative branch of symbolic data. Appearing in a broad range of fields of applications, interval data has risen in popularity. Different from single-valued data, the internal structure of interval data brings a great challenge. In this thesis, we study interval data and how to deal with this internal structure. The focus of the study is mainly on studying the relationships between two interval variables, including how to visualize and measure such relationships, and also the relationships between one interval variable and other single-valued variables. Starting with explanatory data analysis, two visualization methods are proposed to help understanding the relationships between two interval variables. Analogous to the scatter plot for single-valued data, the rectangle plot and cross plot are the conventional visualization methods, where the horizontal and vertical axes represent the two variables respectively. However, these methods do not provide sufficient information to assess many complicated relationships. The proposed visualization methods: Segment and Dandelion Plots, offer much more information than the existing visualization methods and facilitate greater understanding of the relationship between two variables in interval forms. A general guide for reading these plots is provided. Relevant theoretical support is developed. Both empirical and real data examples are provided to demonstrate the advantages of the proposed visualization methods. When a relationship between two interval variables are visualized, quantifying such relationship is the natural next step. Interval correlation measures have been introduced in different perspectives of the literature. Interval data has been viewed as the quantitative branch of symbolic data and a special case of fuzzy data. Correlation measures also have been introduced from a regression model point of view. Here we propose to view interval data as matrices and hence matrix correlation can be applied. We explore the pros and cons for each point of view and propose new correlation tests based on the visualization methods proposed. In addition to studying the relationship between two interval variables, we extend the topic to studying the relationship between an interval variable and other single-valued variables. Our main objective is to find an optimal combination of single-valued input variables such that an interval-valued response is optimized. We propose to use Response Surface Methodology (RSM) to solve this problem. We propose using both a bootstrap approach and desirability approach with different scenarios and demonstrate that each proposed method provides a reliable solution. This is the first time that RSM approach is applied to interval response. The main contributions of this dissertation reflect in three parts. First, we provide new and powerful visualization tools for interval data. They show more abilities than the conventional methods including the six pairwise scatter plots and rectangle/cross plot. Second, we explore different correlation measures from four different perspectives and discuss their common challenges and individual advantage and disadvantage. At last, we provide detailed procedures of using Response Surface Methodology with the interval response problem. This is the first work using RSM with interval response.