Graduate Theses/Dissertations

A Retrieval Augmented Approach to Improving Accuracy of Biomedical Term Normalization by Large Language Models

Thanh Son Do, Missouri State UniversityFollow

Date of Graduation

Spring 2025

Degree

Master of Science in Computer Science

Department

Computer Science

Committee Chair

Tayo Obafemi-Ajayi

Abstract

Ontology normalization is crucial in biomedical text processing, as it enables the mapping of medical expressions to standardized ontology terms and their corresponding identifiers. This thesis explores the feasibility of using large language models (LLMs) for ontology normalization, with a specific focus on the Human Phenotype Ontology and Gene Ontology. Prior research studies indicated that LLMs employing zero-shot learning tend to exhibit low accuracy and are prone to frequent hallucinations. We propose a retrieval augmented generation (RAG) approach to address these limitations and enhance normalization accuracy. We generated synthetic test sets of ontology-derived synonyms to evaluate normalization performance and developed a validation and classification method based on BioBERT embeddings and cosine similarity. The normalization task aimed to match each generated synonym back to its original seed term. We utilized three large language models (GPT-4o, Llama 3.3 70B, and Phi-4) in our experiments to ensure robustness and generalization of results. The results provide insights into the strengths and limitations of each model, emphasizing the importance of candidate list retrieval quality, embedding strategies, and the inherent capabilities of the language models in influencing normalization performance. Across all test cases, RAG consistently outperformed zero-shot learning and maintained stable performance regardless of term frequency in both ontologies. The findings of this thesis also highlight the effectiveness of LLM-based ontology normalization and underscore the potential of RAG to enhance accuracy and robustness in biomedical text processing. These insights contribute to advancing automated biomedical knowledge integration and retrieval, thereby improving interoperability and facilitating data-driven decision-making in healthcare and the life sciences.

Keywords

biomedical concept, large language model, retrieval-augmented generation, zero-shot learning, candidate list, cosine similarity, synthetic data

Subject Categories

Artificial Intelligence and Robotics | Biomedical Informatics

Copyright

Recommended Citation

Do, Thanh Son, "A Retrieval Augmented Approach to Improving Accuracy of Biomedical Term Normalization by Large Language Models" (2025). Graduate Theses/Dissertations. 4045.
https://bearworks.missouristate.edu/theses/4045

Download

Open Access

Included in

Artificial Intelligence and Robotics Commons, Biomedical Informatics Commons

COinS

Graduate Theses/Dissertations

A Retrieval Augmented Approach to Improving Accuracy of Biomedical Term Normalization by Large Language Models

Date of Graduation

Degree

Department

Committee Chair

Abstract

Keywords

Subject Categories

Copyright

Recommended Citation

Included in

Browse

Search

Author Corner

Graduate Theses/Dissertations

A Retrieval Augmented Approach to Improving Accuracy of Biomedical Term Normalization by Large Language Models

Author

Date of Graduation

Degree

Department

Committee Chair

Abstract

Keywords

Subject Categories

Copyright

Recommended Citation

Included in

Share

Browse

Search

Author Corner