Improving Usability of Noisy and Unstructured Biomedical Imaging and Healthcare Text Data Modalities
Restricted (Penn State Only)
- Author:
- Subbakrishna Adishesha, Amogh
- Graduate Program:
- Informatics
- Degree:
- Doctor of Philosophy
- Document Type:
- Dissertation
- Date of Defense:
- June 09, 2023
- Committee Members:
- Jeffrey Bardzell, Program Head/Chair
James Wang, Major Field Member
Sharon Huang, Chair & Dissertation Advisor
Prasenjit Mitra, Major Field Member
Keith Cheng, Outside Unit & Field Member
Patrick La Riviere, Special Member - Keywords:
- Data Usability
Noise Removal
CT imaging
Histotomography
Micro CT
Angular Upsampling
Healthcare Text data
Healthcare topic modeling
Language Models - Abstract:
- Data driven research and decision making has taken the center-stage with the availability of large-scale data acquisition strategies and this is evident in the pursuit of understanding complex zoological anatomies, mutations and even social interactions of users within large healthcare communities. The hindrance to such ambitious endeavors is a direct result of the data being unreliable primarily due the presence of noise, artifacts or poor structure. We focus our studies on two modalities namely Computed Tomography (CT) and healthcare text data. While the former is vulnerable to a gamut of statistical and procedural noise forms, the latter often lacks a usable structure for feature and content extraction. Our motivation to develop data-centric machine learning pipelines to address these concerns in healthcare data arises due to urgency and impact of the down-stream research tasks that the data entail including automated cancer diagnosis, phenotype identification, medical article recommendation, personal healthcare management and many other vital research questions within the realm of healthcare informatics. For our first modality, we utilize the micro-CT data acquired for larval zebrafish to understand the fundamental composition of noise and the failure of existing deep-learning models in effectively removing noise for both projection and reconstruction domain images. We present a hierarchical structure-based model with novel loss modifications made specifically for the Poisson-Gaussian noise mixture present in CT and through it, illustrate the improvement in the quality of the scan. For an unsupervised setting, we emulate the image-stacking denoising technique used in astro-photography and design a training strategy around it. We improve upon the hierarchical design and propose an advanced vision transformer-based network, ``Dense Residual Hierarchical Transformer,'' with a noise- aware loss function to pay attention to specific regions of the image and through this, perform angular upsampling artifact removal to improve scan acquisition speed and efficiency. In our final study, we combine the insights from the previous experiments into a unified and modular architecture titled "Pretrained CT Transformer" and apply the same to address a variety of noise and artifacts in CT imaging. For the healthcare text modality, we acquire unstructured conversational text from online health communities (HealthUnlocked and Facebook) to recognize time-sensitive informational needs of users and formulate the problem as a text-based topic prediction task. The topics predicted provide clean and structured abstractions of user interests and can be then used for information recommendation. We present the challenges that prevent current models from predicting accurate topics of interest for users. In this direction, we argue for the need to move towards generative models over erroneous classification-based topic prediction models. As a solution, we propose a versatile transformer language model for processing such unstructured text through a sentence generation task and additionally leverage user disease timelines for matching similar users in improving topic prediction accuracy. Ultimately, we accurately forecast time sensitive topics of interests for users within online health communities. We conduct detailed experiments to evaluate the value of information and its impact on the structure and accuracy of the predicted topics. We have designed a detailed survey to evaluate the relevance, accuracy and trustworthiness of our recommendations through clinician validation of our recommendations. We conclude by emphasizing the need for data-specific models for improving data usability and illustrate the impact of our contributions through efficiency gains and resource conservation metrics. Beyond what we introduce in this dissertation, we list potential avenues for improvements through novel deep learning paradigms like diffusion models and large language models as well as expert-in-the- loop validation techniques.