Date of Graduation
Spring 2025
Degree
Master of Science in Computer Science
Department
Computer Science
Committee Chair
Tayo Obafemi-Ajayi
Abstract
Ontology normalization is crucial in biomedical text processing, as it enables the mapping of medical expressions to standardized ontology terms and their corresponding identifiers. This thesis explores the feasibility of using large language models (LLMs) for ontology normalization, with a specific focus on the Human Phenotype Ontology and Gene Ontology. Prior research studies indicated that LLMs employing zero-shot learning tend to exhibit low accuracy and are prone to frequent hallucinations. We propose a retrieval augmented generation (RAG) approach to address these limitations and enhance normalization accuracy. We generated synthetic test sets of ontology-derived synonyms to evaluate normalization performance and developed a validation and classification method based on BioBERT embeddings and cosine similarity. The normalization task aimed to match each generated synonym back to its original seed term. We utilized three large language models (GPT-4o, Llama 3.3 70B, and Phi-4) in our experiments to ensure robustness and generalization of results. The results provide insights into the strengths and limitations of each model, emphasizing the importance of candidate list retrieval quality, embedding strategies, and the inherent capabilities of the language models in influencing normalization performance. Across all test cases, RAG consistently outperformed zero-shot learning and maintained stable performance regardless of term frequency in both ontologies. The findings of this thesis also highlight the effectiveness of LLM-based ontology normalization and underscore the potential of RAG to enhance accuracy and robustness in biomedical text processing. These insights contribute to advancing automated biomedical knowledge integration and retrieval, thereby improving interoperability and facilitating data-driven decision-making in healthcare and the life sciences.
Keywords
biomedical concept, large language model, retrieval-augmented generation, zero-shot learning, candidate list, cosine similarity, synthetic data
Subject Categories
Artificial Intelligence and Robotics | Biomedical Informatics
Copyright
© Thanh Son Do
Recommended Citation
Do, Thanh Son, "A Retrieval Augmented Approach to Improving Accuracy of Biomedical Term Normalization by Large Language Models" (2025). Graduate Theses/Dissertations. 4045.
https://bearworks.missouristate.edu/theses/4045
Open Access