Mitigating Automation Bias in Physician-LLM Diagnostic Reasoning Using Behavioral Nudges

Summary

The goal of this randomized controlled trial is to evaluate whether behavioral nudges can reduce automation bias, the uncritical acceptance of automated output, in physicians using large language models (LLM) like ChatGPT-5.1 for clinical decision-making.

The main question it aims to answer is: Does a dual-mechanism behavioral nudge intervention (baseline accuracy anchoring plus case-specific color-coded confidence signals) reduce physicians' uncritical acceptance of incorrect LLM recommendations?

Researchers will compare physicians who receive LLM recommendations along with a behavioral nudge to those who receive LLM recommendations without the nudge to assess if the nudge reduces automation bias.

Participants will:

* Evaluate six clinical vignettes accompanied by LLM-generated recommendations (half containing deliberate, clinically significant errors).
* Control group: Be able to view LLM recommendations in standard format without the nudge.
* Treatment group: Be able to view ChatGPT's diagnostic accuracy on standard medical datasets as an initial anchor, then receive color-coded confidence signals alongside each recommendation (e.g., red for low confidence).
* Have their responses evaluated by blinded reviewers using an expert-developed assessment rubric to detect uncritical acceptance of erroneous information.

Conditions

Diagnosis

Interventions

OTHER

Behavioral Nudge Intervention

Participants in the treatment group will receive a behavioral nudge intervention embedded in the LLM recommendations interface that presents two synchronized cognitive cues when the LLM panel is expanded: (1) an anchoring cue displaying ChatGPT's baseline diagnostic accuracy on standard medical datasets at the top of the panel to set realistic expectations before viewing the specific recommendation, and (2) a selective attention cue located immediately below, which shows the LLM recommendation alongside a case-specific and color-coded confidence signal. This signal is categorized as red when the mean ensemble confidence falls below the established baseline accuracy, flagging high-uncertainty cases that demand critical evaluation; orange when confidence meets or exceeds the baseline but remains below 100%, intended to prevent complacency and maintain active clinical scrutiny; and green for a 100% ensemble consensus, though standard cautionary warnings still apply to guard against.

Sponsors & Collaborators

Lahore University of Management Sciences
lead OTHER

Principal Investigators

Ihsan Ayyub Qazi, PhD · Lahore University of Management Sciences (LUMS)
Muhammad Hamad Alizai, PhD · Lahore University of Management Sciences (LUMS)
Muhammad Asadullah Khawaja, MBBS · King Edward Medical University
Ali Zafar Sheikh, MBBS · Lahore General Hospital
Muhammad Junaid Akhtar, MBBS · Children's Hospital, Lahore

Study Design

Allocation: RANDOMIZED
Purpose: DIAGNOSTIC
Masking: SINGLE
Model: PARALLEL

Eligibility

Sex: ALL
Healthy Volunteers: Yes

Timeline & Regulatory

Start: 2026-01-17
Primary Completion: 2026-07-31
Completion: 2026-08-31

Countries

Pakistan

Mitigating Automation Bias in Physician-LLM Diagnostic Reasoning Using Behavioral Nudges

Summary

Conditions

Interventions

Sponsors & Collaborators

Principal Investigators

Study Design

Eligibility

Timeline & Regulatory

Countries

Study Locations

More Related Trials

Summary

Conditions

Interventions

Sponsors & Collaborators

Principal Investigators

Study Design

Eligibility

Timeline & Regulatory

Countries

Study Locations

Related Clinical Trials

Automation Bias in Physician-LLM Diagnostic Reasoning

The Impact of Large Language Models on Diagnostic Reasoning Among LLM-Trained Medical Doctors

Physician Reasoning on Management Cases With Large Language Models

Physician Reasoning on Diagnostic Cases With Large Language Models

Testing an AI Large Language Model Tool for Cognitive Debiasing in Musculoskeletal Care

More Related Trials