AIが人間の思考を驚くほど正確に読み取る新モデルを開発(Can AI read humans’ minds? A new model shows it’s shockingly good at it)

2025-12-08 テキサス A&M大学

Texas A&M University 工学部とKorea Advanced Institute of Science and Technology(KAIST)などの国際研究チームは、新しい AI システム OmniPredict を開発した。これは、マルチモーダル大規模言語モデル(MLLM)を使い、画像や映像、文脈情報など複数の情報源から人間――主に歩行者――の「次の行動」をリアルタイムで予測する。そして、従来の視覚モデルや単純な動き予測を超え、人間の“行動の意図”や“次の動き”を当てる精度の高いモデルとして機能する可能性を示した。
実験では、既存のベンチマーク(JAAD や WiDEVIEW)に対し、事前の専門的学習なしでも約 67% の精度を記録し、既存モデルを上回る性能を発揮。視覚だけでなく、目線、身体の姿勢、速度、周囲の状況など多様な入力を総合して「歩行者が横断するか/隠れるか/歩き続けるか」などを予測する。これにより、自動運転車の安全性向上や歩行者事故の減少、群衆動態の理解、緊急時の行動予測など多方面への応用が期待される。ただし、あくまで研究モデルであり、即実用化には慎重さが求められている。

AIが人間の思考を驚くほど正確に読み取る新モデルを開発(Can AI read humans’ minds? A new model shows it’s shockingly good at it)
An overview of OmniPredict: GPT-4o-powered system that blends scene images, close-up views, bounding boxes, and vehicle speed to understand what pedestrians might do next. By analyzing this rich mix of inputs, the model sorts behavior into four key categories—crossing, occlusion, actions, and gaze—to make smarter, safer predictions.Credit: Dr. Srinkanth Saripalli/Texas A&M University College of Engineering. https://doi.org/10.1016/j.compeleceng.2025.110741

<関連情報>

GPT-4oによるマルチモーダル理解で一般化可能な歩行者行動予測を強化 Multimodal understanding with GPT-4o to enhance generalizable pedestrian behavior prediction

Je-Seok Ham, Jia Huang, Peng Jiang, Jinyoung Moon, Yongjin Kwon, Srikanth Saripalli, Changick Kim
Computers and Electrical Engineering  Available online: 18 October 2025
DOI:https://doi.org/10.1016/j.compeleceng.2025.110741

Highlights

  • First study applying GPT-4o in OmniPredict for pedestrian behavior prediction.
  • Achieve 67% prediction accuracy in zero-shot setting without task-specific training.
  • Surpass the latest MLLM baselines by 10% on pedestrian crossing intention prediction.
  • Predict crossing, occlusion, action, and look using multi-contextual modalities.
  • Demonstrate strong generalization across unseen driving scenarios without retraining.

Abstract

Pedestrian behavior prediction is one of the most critical tasks in urban driving scenarios, playing a key role in ensuring road safety. Traditional learning-based methods have relied on vision models for pedestrian behavior prediction. However, fully understanding pedestrians’ behaviors in advance is very challenging due to the complex driving environments and the multifaceted interactions between pedestrians and road elements. Additionally, these methods often show a limited understanding of driving environments not included in the training. The emergence of Multimodal Large Language Models (MLLMs) provides an innovative approach to addressing these challenges through advanced reasoning capabilities. This paper presents OmniPredict, the first study to apply GPT-4o(mni), a state-of-the-art MLLM, for pedestrian behavior prediction in urban driving scenarios. We assessed the model using the JAAD and WiDEVIEW datasets, which are widely used for pedestrian behavior analysis. Our method utilized multiple contextual modalities and achieved 67% accuracy in a zero-shot setting without any task-specific training, surpassing the performance of the latest MLLM baselines by 10%. Furthermore, when incorporating additional contextual information, the experimental results demonstrated a significant increase in prediction accuracy across four behavior types (crossing, occlusion, action, and look). We also validated the model s generalization ability by comparing its responses across various road environment scenarios. OmniPredict exhibits strong generalization capabilities, demonstrating robust decision-making in diverse and unseen driving rare scenarios. These findings highlight the potential of MLLMs to enhance pedestrian behavior prediction, paving the way for safer and more informed decision-making in road environments.

1603情報システム・データ工学
ad
ad
Follow
ad
タイトルとURLをコピーしました