Unsupervised Embeddings and LLMs Based Methodology for Hypothesis Generation In Biomechanics
Restricted (Penn State Only)
- Author:
- Mathew, Vinay Saji
- Graduate Program:
- Industrial Engineering
- Degree:
- Master of Science
- Document Type:
- Master Thesis
- Date of Defense:
- March 25, 2024
- Committee Members:
- Soundar Kumara, Thesis Advisor/Co-Advisor
Peter Butler, Committee Member
Steven Landry, Program Head/Chair - Keywords:
- NLP
Biomechanics
LLMs
Word2Vec
Hypothesis Generation - Abstract:
- Hypothesis generation is defined as formulating specific, testable predictions or assumptions based on existing knowledge, observations, or theories. Automated hypothesis generation uses computational tools to generate hypotheses or research directions from large datasets or bodies of knowledge automatically. To this end, this study investigates how word embedding models, like GloVe and Skip-Gram, can be applied to streamline hypothesis generation in biomedical literature. This area is particularly challenging due to the constantly changing terminology used to describe biomedical entities such as proteins and lipids. Our main goal was to apply methods from material science, inspired by a significant discovery reported in Nature —which demonstrated the prediction of certain materials' properties prior to discovery—to biomechanics. This field is known for its intersections with several fields and a complex evolving nomenclature. We used large language models (LLMs) for two important tasks, namely for aggragting relevant literature and for Named Entity Recognition (NER). Part of our effort included creating a binary classifier to filter through biomedical literature more efficiently, allowing us to build a better dataset for training. Much of our work also went into tuning the models' hyper-parameters and evaluating their performance, especially in how well they could recognize and categorize entities based on their functions. Unlike traditional studies that rely on co-occurrence networks, our work adopted weighted network modelling techniques to capture more relevant information. By analyzing query words and adjusting their importance in the network, we saw improvements in the model's ability to identify biological entities based on semantic similarities, even without labelled data. Our findings highlight the power of self-supervised algorithms in organizing biological entities and suggest a new, generalized method for analyzing scientific literature. The results indicate that a collaborative approach could effectively uncover knowledge and relationships hidden within the extensive body of scientific texts. This points to a promising direction for future research in the biomedical field and beyond, using the potential of vast datasets without the need for manual labelling.