Applications of Deep Learning on Temporal and Graph Data Mining
Restricted (Penn State Only)
- Author:
- He, Fang
- Graduate Program:
- Computer Science and Engineering
- Degree:
- Doctor of Philosophy
- Document Type:
- Dissertation
- Date of Defense:
- June 09, 2023
- Committee Members:
- Kamesh Madduri, Major Field Member
Ting He, Major Field Member
Wang-Chien Lee, Chair & Dissertation Advisor
Zhen Lei, Outside Unit & Field Member
Chitaranjan Das, Program Head/Chair - Keywords:
- Temporal Data Mining
Graph Data Mining
Representation Learning
Deep Learning - Abstract:
- Temporal data and graph data analysis are two important fields of data mining. Both temporal data (including time series data) and graph data are ubiquitous in the real world, involved in various applications, e.g., time series classification, time series regression, time series clustering, time series anomaly detection, path travel time estimation, path destination prediction, path recommendation in the road network and link prediction in the social network. Recently, representation learning tasks for time series and graph data have attracted significant research interest. In this thesis, we explore various applications and representation learning frameworks related to temporal and graph data. First, we study the problem of time series classification, which maps a label-unknown time series to its label. We propose the idea of relationship features among time points (and time intervals) to address the issue of learning global features in time series for time series classification. We propose a Rel-CNN model with local pattern global relationship blocks to capture both the local and global (relationship) features. We further discuss the issue of excessive parameters and propose a hierarchical convolution to reduce the number of model parameters without compromization of model performance. Second, we study the problems of citation forecasting in the publication citation network. On one hand, we highlight the importance of both retrospective and prospective aspects of publications in the publication citation network to capture the knowledge flow and a publication’s future impacts. We distinguish the different citation relationship with different temporal distance. On the other hand, we highlight the relevance between the related citation event sequences of related publications and the (future) citations of the focal publication. Exploiting both the two ideas, we propose a model to predict the arrival times of future citations one by one for each focal publication to facilitate future citation forecasting. Third, we study the problem of representation learning of paths by capturing the varied traveler behaviors on the path into its representation. We highlight the importance of learning a distributional representation (instead of a latent vector) as the path representation to improve its ability to capture complicated traveler behavior. We propose an idea of regarding each sample point from the distribution as a representative of a possible traveler trace on the path so that we can generate a set of possible traveler behaviors (i.e., traces) from the distribution. As such, constrain the generated traveler behavior to approach the same distribution of the historical traveler behavior on the same path, the path distribution (which covers all the possible traveler behaviors on the path) is believed to be a good representation of the path. Finally, we propose a framework for representation learning of time series. We point out the limitation of the existing works that they do not explicitly capture the time series dynamics in time series to its representation. To address this issue, we propose a recurrence plot recovery task to challenge the time series representations to recover the time series dynamics explicitly in order to improve the representation quality. In addition, we explore multiple source time series datasets to pre-train the time series encoder and propose a neural network based source selector to select beneficial source datasets. For source selection, we address the issue of representing a whole time series dataset by exploiting the model parameters in the pre-trained model with the time series dataset as its representative. The experimental results validate the superiority of the proposed source selector. Finally, we propose a framework for time series representation learning with all the above ideas realized. We conduct extensive experiments to evaluate our proposed approaches. The experiment results demonstrate the effectiveness of our approaches and their superiority over the state of the arts in the corresponding research domains.