Medical Chat Performance Evaluation

Evaluation Date: January 24th, 2024

USMLE Sample Exam

The Medical Chat model demonstrates an exceptional accuracy performance, achieving a remarkable accuracy rate of 98.1%(637/649) on the United States Medical Licensing Sample Exam(USMLE, https://www.usmle.org/).
You can find answers that Medical Chat model produces in this table:
As far as our knowledge extends, this represents the highest level of performance among question-answering systems evaluated on the USMLE sample exam. The accompanying graph provides a visual representation of how Medical Chat compares to other publicly available models.
The 2022 USMLE sample benchmark served as the initial assessment platform for evaluating the medical question-answering proficiency of ChatGPT [1]. Performance metrics for other systems, namely OpenEvidence [2], GPT4 [3], and Claude 2 [4], were derived from their respective publications and reports.

MedQA US Samples Exam

MedQA serves as a benchmark akin to the USMLE sample exam, encompassing a dataset curated from various medical board examinations. This dataset comprises multiple-choice questions designed to assess proficiency in subjects such as Internal Medicine, Pediatrics, Psychiatry, and Surgery, among others. The evaluation of Medical Chat was conducted on MedQA's 4-option English test set, encompassing a total of 1,273 questions.
Medical Chat also demonstrated the highest performance, achieving an accuracy rate of 97.8%. This outcome places Medical Chat in first position on the Official Leaderboard, surpassing Google's Med-PaLM 2 and Google's Flan-PaLM (67.6%). The results from the MedQA evaluation assert that Medical Chat stands out as the most accurate medical question-answering system available for public use.
You can find answers that Medical Chat model produces in this table:
Question SetMedical Chat AnswerCorrectness
MedQA Correctness Check - US.jsonlMedical Chat Answer1245/1273(97.8%)

Open-Source Code for Medical Chat Model Evaluation

Our evaluation process is conducted through an automated API call to our Medical Chat model utilizing the Chat Data API infrastructure. The source code is openly available in our GitHub repository, enabling anyone to clone the repository and replicate our evaluation procedure. Following the generation of responses by the Medical Chat model, a manual comparison is undertaken between the model's responses and the correct answers to ascertain the accuracy percentage.