Subject Category Classification of Scholarly Papers Using Deep Attentive Neural Networks

Open Access
- Author:
- Kandimalla, Bharath
- Graduate Program:
- Computer Science and Engineering
- Degree:
- Master of Science
- Document Type:
- Master Thesis
- Date of Defense:
- March 23, 2020
- Committee Members:
- Clyde Lee Giles, Thesis Advisor/Co-Advisor
Daniel Kifer, Committee Member
Chitaranjan Das, Program Head/Chair - Keywords:
- text classification
scholarly big data
text mining
attention model
deep neural networks
word embedding
web of science - Abstract:
- The subject categories often seen at the beginning of a scholarly paper refers to the knowledge domain(s) of an article’s content. Search engines can leverage these subject categories to build faceted search features, helping users to narrow down search results, and significantly improving search precision by boosting the ranks of highly relevant documents. Unfortunately, many academic papers, including conference proceedings and journal articles do not have such information as part of their metadata. Some existing clustering methods for subject classification are usually based on a citation network that is not always available. Here, we propose a deep attentive neural network (DANN) that classifies scholarly papers using only the abstracts. In contrast with the existing work on text classification, ours is challenging because the training data is highly imbalanced, the number of classes is relatively high, and closely-related categories (e.g., Biology and Zoology) may overlap. We adopt the Web of Science schema including 104 subject categories. We explore state-of-the-art neural network architectures and strategies to overcome imbalanced samples and overlapped categories. Our method is advantageous compared to the previously published unsupervised clustering-based methods on publication level classification. This is because our method uses only the article metadata instead of citation relations increasing its portability and robustness. Our best model achieves micro-F1measures ranging from 0.50–0.95. We apply this model to classify a random sample of 1 million paper abstracts in CiteSeerX.