Multimodal Approaches Towards Computer-Aided Oral-Facial Diagnoses of Cerebrovascular Accidents

Restricted (Penn State Only)
- Author:
- Cai, Tongan
- Graduate Program:
- Informatics
- Degree:
- Doctor of Philosophy
- Document Type:
- Dissertation
- Date of Defense:
- October 25, 2024
- Committee Members:
- Dongwon Lee, Professor in Charge/Director of Graduate Studies
James Wang, Chair & Dissertation Advisor
Sharon Huang, Major Field Member
Jia Li, Outside Unit & Field Member
Justin Silverman, Major Field Member - Keywords:
- Artificial Intelligence
Deep Learning
Multimodal
Stroke
Computer-Aided Diagnosis - Abstract:
- Multimodal deep learning stands for one of the cutting-edge fields in artificial intelligence (AI) and machine learning (ML), with recent research work demonstrating tremendous success in integrating diverse data modalities, including text, images, audio, and videos. The AI/ML-based detection and evaluation of neurological diseases that cause severe, life-threatening conditions is of significant clinical value due to its significant accuracy improvements, operational cost reduction, and high distribution flexibility, and has been of recent interest to the multimodal research community. While the advancements in computation power accelerate the proposal of novel computer vision (CV) methods for medical imaging data and natural language processing (NLP) methods on electronic health records (EHR), the possibility of phenotyping neurological diseases from patient face and oral speech--that physicians typically leverage in the clinical triage protocols--is less explored. The domain suffers long-going challenges, including the scarcity of publically available patient data for modeling, the complexity of real-world noise in the deployment environments, and the subtlety of spatiotemporal oral-facial features, all of which pose difficulties to the feasibility of such a multimodal approach. This dissertation aims to bridge the research gap and leverage advanced multimodal AI approaches in the clinical triage of neurological diseases from patient face and oral speech. Specifically, we focus on a complete AI framework of Cerebrovascular Accidents (CVA) in emergency room (ER) environments that is easily transferable to other conditions. The proposed framework contains three major components that are developed based on the latest ideas in pattern recognition and deep learning. First, a multimedia patient pre-filtering is introduced that aims to identify significant CVA patients from healthy controls and directly navigate them to interventions. The pre-filtering is based on a rule-based facial video asymmetry analysis and text embedding-based speech deficits evaluation. Second, the evolution of advanced Multimodal Frameworks for Mild to Moderate Acute Strokes Triage is presented. The latest multimodal sequence representation integrates temporal features using multi-scale information fusion and state-space temporal modeling, as well as a powerful audio transformer, to tackle the challenge of feature subtlety. A fairness-aware pre-training and an adversarial identity disentanglement are integrated for stronger generalizability. Third, a multimodal framework is further proposed for privacy-preserving synthetic patient generation, where patients' facial characteristics and speech voices are re-targeted to a realistic avatar with synthetic identity. The generated synthetic video can be shared within the domain to alleviate data scarcity. The thesis concludes with a discussion of anticipated future research and potential generalization of the framework to other neurological diseases.