Relevant Information Extraction and Lexical Simplification of Unstructured Clinical Notes

Open Access
- Author:
- Doppalapudi, Shreyesh
- Graduate Program:
- Data Analytics
- Degree:
- Master of Science
- Document Type:
- Master Thesis
- Date of Defense:
- March 17, 2021
- Committee Members:
- Guanghua Qiu, Thesis Advisor/Co-Advisor
Youakim Badr, Committee Member
Partha Mukherjee, Committee Member
Ashkan Negahban, Committee Member
Colin Neill, Program Head/Chair - Keywords:
- Natural Language Processing
Relevant Text Extraction
Biomedical Lexical Simplification
Clinical Notes
Multi-Label Classification
Disease Code Classification
Readability Indices
MIMIC-III
Transformer Model
BERT
Health Literacy
Word Embedding - Abstract:
- Health literacy is essential for a person’s health maintenance and overall well-being. According to the U.S. Department of Education, only 14% of the U.S. adult population scored in the highest literacy proficiency level, 10% in the highest numeracy proficiency level and 6% in the highest digital skill proficiency level requiring clarity of medical communication to understand health information. Inconsistent writing structures, styles and jargons hamper the ability of a consumer to synthesize healthcare information effectively. In this study, we tackle this problem by proposing methods for transforming relevant information into an understandable format using clinical notes obtained from MIMIC-III database. We propose an unsupervised keyword matching based method to extract relevant diagnosis information from long, unstructured clinical notes using a Word Embedding to create a similar word vocabulary. A diagnosis code classification model was used to evaluate the results from the extraction. The multi-label classification model yielded a 71% and 68% accuracy on top-50 and top-100 4-digit International Classification of Diseases (ICD) codes while achieving a F-1 score of 58.5% and 48.2% respectively. In addition, we developed a transformer-based lexical simplification model to identify and replace complex words with simpler words while experimenting with different embeddings and ranking mechanisms. The study employs Readability Indices to test for simplicity of output text and machine translation metric BLEU and text simplification metric SARI to measure degree of change of text as evaluation metrics for the lexical simplification model. The best-performing model achieved scores of 8.75, 5.12, 6.54 on the readability indices of Gunning Fog, Flesch-Kincaid and Coleman-Liau respectively. The model also achieved low degree of change as evidenced by BLEU score of 79.96 and SARI score of 27.68. The performance of both the models were close to the state-of-art research in the healthcare analytics and general language modelling domain. These models provide an opportunity to enforce better health literacy helping both healthcare organizations and consumers.