OpenEvidence Safety and Comparative Efficacy of Four LLM's in Clinical Practice

Summary

OpenEvidence is an online tool that aggregates and synthesizes data from peer-reviewed medical studies, then producing a response to a user's questions using generative AI. While it is in use by a number of clinicians (including residents) today, there is little to no published data on whether the tool's outputs are accurate and whether this information appropriately informs clinical decision making. Similarly, a number of clinicians are turning to other large language models (LLM's) to assist in decision making when providing clinical care. While there have been a number of studies published on the accuracy of these LLM's responses to medical boards questions or clinical vignettes, there have been few studies to date examining their performance in a real world clinical setting, and even fewer comparing this performance.

In this study, investigators have two goals:

1. To determine whether the use of the AI tool "OpenEvidence" leads to clinically appropriate decisions when utilized by family medicine, internal medicine, and psychiatry residents in the course of clinical practice.
2. To determine how the output of the OpenEvidence tool compares with three other commonly-used, publicly-available large language models (OpenAI's ChatGPT, Anthropic's Claude, and Google's Gemini) in answering common questions that residents have in the course of clinical practice.

To accomplish study goal #1, investigators have enlisted residents in the above specialties to use the OpenEvidence tool in the course of clinical practice. In order to mitigate any safety risks, the residents will also use a typical reference tool for their question, which is referred to as the "Gold Standard" tool. These tools include PubMed and UpToDate. The residents will:

1. State their clinical question.
2. Query OpenEvidence, capturing their prompt and the OpenEvidence output for data analysis. All residents will undergo training in prompt engineering at the start of the study.
3. State their clinical conclusion based on the OpenEvidence data.
4. Query the Gold Standard Resource.
5. State their final clinical conclusion.
6. Answer a question on whether their clinical conclusion was modified by the Gold Standard reference.
7. Answer a question on whether they had any clinical safety concerns on the output from OpenEvidence.

Attending physician Subject Matter Experts (SMEs) matched by specialty with at least 5 years of post-training clinical experience will then evaluate the residents' responses. 5 years was chosen based the book "Outliers" by Malcolm Gladwell, in which he asserts that 10,000 hours of focused practice is needed to achieve expertise in a field.

SMEs will be asked to evaluate the residents' initial clinical questions and their conclusions based only on OpenEvidence. They will be asked to rate the clinical appropriateness of those conclusions on a scale of 1-10. For questions where the SME's rate the clinical appropriateness of the residents' conclusions poorly (\< 5/10), they will be asked to review the OpenEvidence output and answer an additional question as to whether the output was incorrect or the resident misinterpreted the output from the tool.

To accomplish goal #2, the initial prompt entered by the residents into OpenEvidence will be copied by the research team into ChatGPT, Gemini, and Claude. The outputs from each tool (including OpenEvidence) will be surfaced to SMEs, who will be asked to rate each output based on accuracy, completeness, and bias. Likert scales will be used for these ratings. SMEs will also be asked an open-ended question to identify any patient safety issues from any of the outputs.

Conditions

AI (Artificial Intelligence)
Large Language Model
Generative Artificial Intelligence

Interventions

OTHER

AI clinical reference tool

Residents will use OpenEvidence clinical reference tool in the course of routine clinical care. They must also use a Gold Standard clinical reference tool (e.g. PubMed, UpToDate) to mitigate risk.

Sponsors & Collaborators

Cambridge Health Alliance
lead OTHER

Principal Investigators

Hannah K Galvin, MD · Cambridge Health Alliance

Eligibility

Sex: ALL
Healthy Volunteers: Yes

Timeline & Regulatory

Start: 2025-10-01
Primary Completion: 2026-05-30
Completion: 2026-09-30

Countries

United States

OpenEvidence Safety and Comparative Efficacy of Four LLM's in Clinical Practice

Summary

Conditions

Interventions

Sponsors & Collaborators

Principal Investigators

Eligibility

Timeline & Regulatory

Countries

Study Locations

More Related Trials

Summary

Conditions

Interventions

Sponsors & Collaborators

Principal Investigators

Eligibility

Timeline & Regulatory

Countries

Study Locations

Related Clinical Trials

Physician Response Evaluation With Contextual Insights vs. Standard Engines - Artificial Intelligence RAG vs LLM Clinical Decision Support

Large Language Models To Improve the Quality of Care of Cardiology Patients

Large Language Models to Aid Gynecological Oncology Treatment

Attitudes and Perceptions of Corresponding Authors From Top International Medical Journals Regarding the Use of Artificial Intelligence in the Scientific Process

Evaluation of AI-Generated Clinical Advice by Physicians

More Related Trials