Medical Chat Performance Evaluation
Evaluation Date: January 24th, 2024
USMLE Sample Exam
The Medical Chat model demonstrates an exceptional accuracy performance, achieving a remarkable accuracy rate of
98.1%(637/649) on the United States Medical Licensing Sample Exam(USMLE,
https://www.usmle.org/).
You can find answers that Medical Chat model produces in this table:
As far as our knowledge extends, this represents the highest level of performance among question-answering systems evaluated on the USMLE sample exam. The accompanying graph provides a visual representation of how Medical Chat compares to other publicly available models.
The 2022 USMLE sample benchmark served as the initial assessment platform for evaluating the medical question-answering proficiency of ChatGPT
[1]. Performance metrics for other systems, namely OpenEvidence
[2], GPT4
[3], and Claude 2
[4], were derived from their respective publications and reports.
MedQA US Samples Exam
MedQA serves as a benchmark akin to the USMLE sample exam, encompassing a dataset curated from various medical board examinations. This dataset comprises multiple-choice questions designed to assess proficiency in subjects such as Internal Medicine, Pediatrics, Psychiatry, and Surgery, among others. The evaluation of Medical Chat was conducted on MedQA's 4-option English test set, encompassing a total of 1,273 questions.
Medical Chat also demonstrated the highest performance, achieving an accuracy rate of
97.8%. This outcome places Medical Chat in first position on the
Official Leaderboard, surpassing
Google's Med-PaLM 2 and
Google's Flan-PaLM (67.6%). The results from the MedQA evaluation assert that Medical Chat stands out as the most accurate medical question-answering system available for public use.
You can find answers that Medical Chat model produces in this table:
Open-Source Code for Medical Chat Model Evaluation
Our evaluation process is conducted through an automated API call to our Medical Chat model utilizing the
Chat Data API infrastructure. The source code is openly available in
our GitHub repository, enabling anyone to clone the repository and replicate our evaluation procedure. Following the generation of responses by the Medical Chat model, a manual comparison is undertaken between the model's responses and the correct answers to ascertain the accuracy percentage.