CS 6804 Final Project: Evaluating Model Reasoning and Hallucinations in Medical LLMs

This research study focuses on the challenges posed by hallucinations in medical large language models (LLMs), particularly in the context of the medical domain. We evaluate the reasoning abilities of popular open-source medical LLMs on a new benchmark and dataset, Med-HALT (Medical Domain Hallucination Test), designed specifically to evaluate hallucinations in medical domain. This benchmark provides a diverse multinational dataset derived from medical examinations across various countries and includes multiple innovative testing modalities.

Dataset

Datasets are hosted in Huggingface's dataset

Inference Instructions

python bio_llm.py <test_name> <batch> <model_name>

Here test_name can be FCT, Nota or fake depending on which test we want to run
Here model_name can be asclepius, alpacare, biomistral or pmc_llama depending on which model we want to run the inference for
batch is the batch_size. We use different batch sizes like 4,8 and 12.

Evaluation Instructions

Find the evaluation notebooks here: https://github.com/abhilash-neog/FactCheckingBioLLMs/tree/Evaluation_Codes

Acknowledgements

We sincerely thank the authors of following open-source projects:

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
evaluation		evaluation
exploration		exploration
inference		inference
medhalt		medhalt
predictions		predictions
prompt_analysis		prompt_analysis
scripts		scripts
.gitignore		.gitignore
README.md		README.md
bio_llm.py		bio_llm.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CS 6804 Final Project: Evaluating Model Reasoning and Hallucinations in Medical LLMs

Dataset

Inference Instructions

Evaluation Instructions

Acknowledgements

About

Releases

Packages

Contributors 3

Languages

abhilash-neog/FactCheckingBioLLMs

Folders and files

Latest commit

History

Repository files navigation

CS 6804 Final Project: Evaluating Model Reasoning and Hallucinations in Medical LLMs

Dataset

Inference Instructions

Evaluation Instructions

Acknowledgements

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages