Creating A Syntatic Document Ontology

Open Access
Han, Hui
Graduate Program:
Computer Science and Engineering
Doctor of Philosophy
Document Type:
Date of Defense:
August 17, 2004
Committee Members:
  • C Lee Giles, Committee Chair/Co-Chair
  • Hongyuan Zha, Committee Member
  • James Z Wang, Committee Member
  • Jia Li, Committee Member
  • document ontology
  • metadata extraction
  • name disambiguation
An ontology is ``a formal explicit specification of a shared conceptualization'. Ontology has been widely studied in recent knowledge representation research, as shown by an increasing number of domain-specific ontologies. With the prevalence of digital libraries, academic documents have become an important part of the information on the web. Document ontologies are gaining increasing importance to the interoperability of heterogeneous digital libraries and the reuse of knowledge embedded in academic documents. Document ontologies have been constructed from two aspects: the semantic structure and the syntactic structure of documents. The semantic structure specifies what the document is about, i.e. the content of the document; the syntactic structure refers to document structures such as title, author, affiliation, keywords, and citation links between documents. The following three aspects are critical to creating a domain specific ontology: (1) semi-automatically creating a domain specific ontology, where techniques such as information extraction and data mining can be exploited; (2) maintaining an unambiguous specification of concepts or relationships; and (3) establishing inference rules to allow knowledge reasoning. This thesis focuses on investigating the first two aspects, as shown by the following three types of work. First, we proposed a syntactic document ontology based on the DAML (DARPA Agent Markup Language) ontology library to model the academic documents. Second, we developed a Support-Vector-Machines(SVM)-based classification method, for automatic document attributes (metadata) extraction from the header parts of documents and the bibliographic fields. Our method of metadata extraction from document headers achieved better results than using hidden Markov Model (HMMs) on the CMU datasets. We also developed a novel method of parsing individual author names from the line of multiple authors. Third, we investigated both supervised and unsupervised learning methods for name entity disambiguation in author citations. We developed two supervised learning methods, one based on a hierarchical naive Bayes model, the other based on the Support Vector Machines. We also developed two unsupervised learning methods, one based on a hierarchical naive Bayes mixture model, the other based on a K-way spectral clustering method with QR decomposition. These methods are applied to 14 name datasets that we constructed based on the publication lists collected from authors' homepages, and the DBLP computer science bibliography. The K-way spectral clustering method with QR decomposition achieved best results, compared to the K-means clustering algorithm and the hierarchical naive Bayes mixture model. The hierarchical-naive-Bayes-model-based method achieved better name disambiguation accuracies than the SVM-based classification method when using only coauthor information. The main reasons are that our hierarchical naive Bayes model captures the author patterns that are not easily incorporated into feature vector space model that is used by the SVM-based classifcation or the K-way spectral clustering methods. These author patterns are the prior probability of an author, the probability that an author writes a paper alone, and the probabilities that an author writes a future paper with previously unseen coauthors. SVM-based classification method achieved slightly better results than the hierarhical-naive-Bayes-model-based method when using paper title words, publication venue title words, or the combination of all types of citation features.