Evaluating Medical Fine-Tuned Large Language Models in Expert-Level Question Answering // Joyce Jiang

For our Large Language Models: Foundations & Ethics class, we decided to research fine-tuned model testing and domain specific large language models. this project leverages various NLP models to evaluate model performance within the healthcare domain. This codebase is designed to evaluate the performance of various open-source models by comparing their outputs to ExpertQA using evaluation metrics such as Smooth BLEU, BERTScore, and Cosine Similarity. The primary goal is to assess how well these models can replicate or improve upon expert-level answers to a variety of questions.

Check out the Github repo here.