OpenEvidence Safety and Comparative Efficacy of Four LLM's in Clinical Practice
NCT07199231 · Status: ENROLLING_BY_INVITATION · Type: OBSERVATIONAL · Enrollment: 20
Last updated 2026-05-22
Summary
OpenEvidence is an online tool that aggregates and synthesizes data from peer-reviewed medical studies, then producing a response to a user's questions using generative AI. While it is in use by a number of clinicians (including residents) today, there is little to no published data on whether the tool's outputs are accurate and whether this information appropriately informs clinical decision making. Similarly, a number of clinicians are turning to other large language models (LLM's) to assist in decision making when providing clinical care. While there have been a number of studies published on the accuracy of these LLM's responses to medical boards questions or clinical vignettes, there have been few studies to date examining their performance in a real world clinical setting, and even fewer comparing this performance.
In this study, investigators have two goals:
1. To determine whether the use of the AI tool "OpenEvidence" leads to clinically appropriate decisions when utilized by family medicine, internal medicine, and psychiatry residents in the course of clinical practice.
2. To determine how the output of the OpenEvidence tool compares with three other commonly-used, publicly-available large language models (OpenAI's ChatGPT, Anthropic's Claude, and Google's Gemini) in answering common questions that residents have in the course of clinical practice.
To accomplish study goal #1, investigators have enlisted residents in the above specialties to use the OpenEvidence tool in the course of clinical practice. In order to mitigate any safety risks, the residents will also use a typical reference tool for their question, which is referred to as the "Gold Standard" tool. These tools include PubMed and UpToDate. The residents will:
1. State their clinical question.
2. Query OpenEvidence, capturing their prompt and the OpenEvidence output for data analysis. All residents will undergo training in prompt engineering at the start of the study.
3. State their clinical conclusion based on the OpenEvidence data.
4. Query the Gold Standard Resource.
5. State their final clinical conclusion.
6. Answer a question on whether their clinical conclusion was modified by the Gold Standard reference.
7. Answer a question on whether they had any clinical safety concerns on the output from OpenEvidence.
Attending physician Subject Matter Experts (SMEs) matched by specialty with at least 5 years of post-training clinical experience will then evaluate the residents' responses. 5 years was chosen based the book "Outliers" by Malcolm Gladwell, in which he asserts that 10,000 hours of focused practice is needed to achieve expertise in a field.
SMEs will be asked to evaluate the residents' initial clinical questions and their conclusions based only on OpenEvidence. They will be asked to rate the clinical appropriateness of those conclusions on a scale of 1-10. For questions where the SME's rate the clinical appropriateness of the residents' conclusions poorly (\< 5/10), they will be asked to review the OpenEvidence output and answer an additional question as to whether the output was incorrect or the resident misinterpreted the output from the tool.
To accomplish goal #2, the initial prompt entered by the residents into OpenEvidence will be copied by the research team into ChatGPT, Gemini, and Claude. The outputs from each tool (including OpenEvidence) will be surfaced to SMEs, who will be asked to rate each output based on accuracy, completeness, and bias. Likert scales will be used for these ratings. SMEs will also be asked an open-ended question to identify any patient safety issues from any of the outputs.
Conditions
- AI (Artificial Intelligence)
- Large Language Model
- Generative Artificial Intelligence
Interventions
- OTHER
-
AI clinical reference tool
Residents will use OpenEvidence clinical reference tool in the course of routine clinical care. They must also use a Gold Standard clinical reference tool (e.g. PubMed, UpToDate) to mitigate risk.
Sponsors & Collaborators
-
Cambridge Health Alliance
lead OTHER
Principal Investigators
-
Hannah K Galvin, MD · Cambridge Health Alliance
Eligibility
- Sex
- ALL
- Healthy Volunteers
- Yes
Timeline & Regulatory
- Start
- 2025-10-01
- Primary Completion
- 2026-05-30
- Completion
- 2026-09-30
Countries
- United States
Study Locations
More Related Trials
-
Large Language Model for Understanding and Monitoring Elderly Neurocognition
NCT07347431 ·Status: RECRUITING
-
Physician Reasoning on Management Cases With Large Language Models
NCT06208423 ·Status: COMPLETED ·Phase: NA
-
The Diagnostic and Triage Capacity of Laypeople-large Language Model Collaboration in China
NCT07250516 ·Status: COMPLETED ·Phase: NA
-
AI-Driven Consent Simplification Study
NCT07303517 ·Status: RECRUITING
-
Effect of Perception-based Interventions on Public Acceptance of Using Large Language Models in Medicine
NCT07304908 ·Status: ACTIVE_NOT_RECRUITING ·Phase: NA
-
Manual Versus AI-Assisted Clinical Trial Screening Using Large-Language Models
NCT06588452 ·Status: RECRUITING
-
Clinical Language Evaluation With AI for Residents
NCT07222644 ·Status: NOT_YET_RECRUITING ·Phase: NA
-
Using Large Language Models Such As GPT-4 to Assess Guideline Adherence in Patients With Chronic Obstructive Pulmonary Disease
NCT06410547 ·Status: COMPLETED ·Phase: NA
-
Testing an AI Large Language Model Tool for Cognitive Debiasing in Musculoskeletal Care
NCT07022769 ·Status: NOT_YET_RECRUITING ·Phase: NA
-
Coronavirus: Ventilator Outcomes Using Artificial Intelligence Chest Radiographs & Other Evidence-based Co-variates
NCT04855539 ·Status: UNKNOWN
-
AI as an Aid for Weekly Symptom Intake in Radiotherapy
NCT06525181 ·Status: RECRUITING ·Phase: NA
-
ChatGPT v.s. Human in Writing a Preoperative Visit Sheet
NCT05945004 ·Status: UNKNOWN
-
Physician Reasoning on Diagnostic Cases With Large Language Models
NCT06157944 ·Status: COMPLETED ·Phase: NA
-
Large Language Model-Generated Messages to Improve Guideline-Directed Medical Therapy in Heart Failure
NCT07337577 ·Status: NOT_YET_RECRUITING ·Phase: NA
-
AI Ethical Assessment in Scientific Resreach
NCT07340905 ·Status: NOT_YET_RECRUITING
-
The Impact of Large Language Models on Diagnostic Reasoning Among LLM-Trained Medical Doctors
NCT06774612 ·Status: COMPLETED ·Phase: NA
-
AI-Assisted Chemotherapy Side Effect Management
NCT07198581 ·Status: RECRUITING ·Phase: NA
-
Enhancing Medical Researchers' Self-learning With an Intelligent Language Model
NCT06015178 ·Status: UNKNOWN ·Phase: NA
-
Point-of-Care AI Assistance and Critical Care Outcomes: A Randomized Trial
NCT07293078 ·Status: NOT_YET_RECRUITING ·Phase: PHASE1/PHASE2
-
Evaluation of an Artificial Intelligence-enabled Clinical Assistant to Support Thyroid Cancer Management
NCT07234539 ·Status: ENROLLING_BY_INVITATION ·Phase: NA
-
Impact of Digital Ambient EXperience on Pediatric Subspecialist Documentation Burden
NCT06812234 ·Status: COMPLETED ·Phase: NA
-
Patient Perspectives on Artificial Intelligence in Radiology
NCT05618860 ·Status: UNKNOWN
-
LLM-Generated Coaching Prompts
NCT06880315 ·Status: NOT_YET_RECRUITING ·Phase: NA
-
Diagnostic Reasoning With Customized GPT-4 Model
NCT06911645 ·Status: COMPLETED ·Phase: NA
-
AI in Respiratory Disease Prevention, Diagnosis, and Triage
NCT06931782 ·Status: ENROLLING_BY_INVITATION ·Phase: NA