Date of Graduation

Spring 2025

Degree

Master of Science in Computer Science

Department

Computer Science

Committee Chair

Tayo Obafemi-Ajayi

Abstract

Ontology normalization is crucial in biomedical text processing, as it enables the mapping of medical expressions to standardized ontology terms and their corresponding identifiers. This thesis explores the feasibility of using large language models (LLMs) for ontology normalization, with a specific focus on the Human Phenotype Ontology and Gene Ontology. Prior research studies indicated that LLMs employing zero-shot learning tend to exhibit low accuracy and are prone to frequent hallucinations. We propose a retrieval augmented generation (RAG) approach to address these limitations and enhance normalization accuracy. We generated synthetic test sets of ontology-derived synonyms to evaluate normalization performance and developed a validation and classification method based on BioBERT embeddings and cosine similarity. The normalization task aimed to match each generated synonym back to its original seed term. We utilized three large language models (GPT-4o, Llama 3.3 70B, and Phi-4) in our experiments to ensure robustness and generalization of results. The results provide insights into the strengths and limitations of each model, emphasizing the importance of candidate list retrieval quality, embedding strategies, and the inherent capabilities of the language models in influencing normalization performance. Across all test cases, RAG consistently outperformed zero-shot learning and maintained stable performance regardless of term frequency in both ontologies. The findings of this thesis also highlight the effectiveness of LLM-based ontology normalization and underscore the potential of RAG to enhance accuracy and robustness in biomedical text processing. These insights contribute to advancing automated biomedical knowledge integration and retrieval, thereby improving interoperability and facilitating data-driven decision-making in healthcare and the life sciences.

Keywords

biomedical concept, large language model, retrieval-augmented generation, zero-shot learning, candidate list, cosine similarity, synthetic data

Subject Categories

Artificial Intelligence and Robotics | Biomedical Informatics

Copyright

© Thanh Son Do

Open Access

Share

COinS