Automatic Summarization and Slide Generation for Scientific Papers

Open Access
- Author:
- Sefid, Athar
- Graduate Program:
- Computer Science and Engineering
- Degree:
- Doctor of Philosophy
- Document Type:
- Dissertation
- Date of Defense:
- October 21, 2021
- Committee Members:
- Chris McComb, Outside Unit & Field Member
C Lee Giles, Chair & Dissertation Advisor
Jesse Barlow, Major Field Member
Chitaranjan Das, Program Head/Chair
Rui Zhang, Major Field Member - Keywords:
- Automatic Summarization
Automatic Slide Generation
Text Mining
Machine Learning
Natural Language Processing - Abstract:
- The growing number of scholarly documents available online is one substantial challenge for fast retrieval of the related research. Therefore, there has been research to improve search, retrieval, and summarization of scholarly papers. Most of the existing summarization models are designed for news articles where a huge amount of training data is available by considering the headline of the article as a summary. This is different from scholarly papers where only domain experts can generate summary of the article. To overcome the shortness of training data for the scientific article summarization task, we devised a crawler to collect scientific papers and their corresponding presentation slides from conference websites. Presentation slides can guide automatic summarizers to select salient sentences or important phrases. Traditional machine learning models identify important sentences based on a limited number of features extracted from the position and structure of sentences in the paper. Our methods extend the previous work by (1) extracting a more comprehensive list of surface features and also considering (2) semantic of the sentence captured by recurrent neural networks or attention mechanism and (3) context around the current sentence to rank and select the sentences. We employed different state of the art neural extractive and abstractive summarizers to build document and sentence embeddings. The extractive methods extract important parts of the input document as the summary while abstractive methods try to comprehend the input document then summarize it by using novel words that are not limited to the vocabulary of the source input. We extended our work to generate presentation slides for scientific papers. These slides can help authors as a starting point for the slide generation process. Our slides are designed to have two layers of bullet points where first-level bullets are frequent main phrases in the paper and important sentences make the second layer of bullets in slides.