診断AIの不確実性を可視化する新システム(AI systems for medical diagnosis with uncertainty awareness)

2026-03-24 マサチューセッツ工科大学(MIT)

マサチューセッツ工科大学の研究は、自身の不確実性を認識し、過信せずに判断できる「謙虚なAI(Humble AI)」の実現を目指したもの。従来のAIは誤りに対しても高い確信度で回答する問題があったが、本研究では予測の信頼度を適切に評価し、不確実な場合には「分からない」と判断できる仕組みを導入した。これにより、医療や自動運転など高リスク分野での安全性向上が期待される。また、人間との協働においても信頼性が高まり、意思決定支援の質の向上につながるとされる。AIの性能だけでなく信頼性・透明性を重視する新たな方向性を示す成果である。

<関連情報>

臨床意思決定支援における好奇心主導型で謙虚なAIのためのエンジニアリングフレームワーク Engineering framework for curiosity-driven and humble AI in clinical decision support

Janan Arslan,Kurt Benke,Sebastian Andres Cajas Ordones,…

BMJ Health and Care Informatics  Published:23 March 2026

診断AIの不確実性を可視化する新システム(AI systems for medical diagnosis with uncertainty awareness)

Abstract

We present BODHI (Balanced, Open-minded, Diagnostic, Humble, and Inquisitive), an engineering framework for curiosity driven and humble clinical decision support artificial intelligence (AI) systems. Despite growing capabilities, large language models (LLMs) often express inappropriate confidence, conflating statistical pattern recognition with genuine medical understanding. BODHI addresses this through a dual reflective architecture that: (1) decomposes epistemic uncertainty into task specific dimensions, and (2) constrains model responses using virtue based stance rules derived from a Virtue Activation Matrix. We validate the framework through controlled evaluation on 200 clinical vignettes from HealthBench Hard, assessing GPT-4o-mini and GPT-4.1-mini across 5 random seeds (2000 total observations). Statistical analysis included bootstrap resampling, paired t tests, and effect size computation. BODHI improved overall clinical response quality (GPT-4.1-mini: +16.6 pp, p<0.0001, Cohen’s d=11.56; GPT-4o-mini: +2.2 pp, p<0.0001, Cohen’s d=1.56) and achieved very large effect sizes on curiosity (context seeking rate: Cohen’s d=16.38 and 19.54) and humility (hedging: d=5.80 for GPT-4.1-mini) metrics. Crucially, 97.3% of GPT-4.1-mini responses and 73.5% of GPT-4o-mini responses included appropriate clarifying questions, compared with 7.8% and 0.0% at baseline, demonstrating the framework’s effectiveness in eliciting information gathering behaviour. Findings suggest LLMs can be reliably constrained to operate within epistemic boundaries when provided with structured uncertainty decomposition and virtue aligned response rules, offering a pathway towards safer clinical AI deployment

1603情報システム・データ工学
ad
ad
Follow
ad
タイトルとURLをコピーしました