2025-12-08 テキサス A&M大学
実験では、既存のベンチマーク(JAAD や WiDEVIEW)に対し、事前の専門的学習なしでも約 67% の精度を記録し、既存モデルを上回る性能を発揮。視覚だけでなく、目線、身体の姿勢、速度、周囲の状況など多様な入力を総合して「歩行者が横断するか/隠れるか/歩き続けるか」などを予測する。これにより、自動運転車の安全性向上や歩行者事故の減少、群衆動態の理解、緊急時の行動予測など多方面への応用が期待される。ただし、あくまで研究モデルであり、即実用化には慎重さが求められている。

An overview of OmniPredict: GPT-4o-powered system that blends scene images, close-up views, bounding boxes, and vehicle speed to understand what pedestrians might do next. By analyzing this rich mix of inputs, the model sorts behavior into four key categories—crossing, occlusion, actions, and gaze—to make smarter, safer predictions.Credit: Dr. Srinkanth Saripalli/Texas A&M University College of Engineering. https://doi.org/10.1016/j.compeleceng.2025.110741
<関連情報>
- https://stories.tamu.edu/news/2025/12/08/can-ai-read-humans-minds-a-new-model-shows-its-shockingly-good-at-it/
- https://www.sciencedirect.com/science/article/pii/S0045790625006846
GPT-4oによるマルチモーダル理解で一般化可能な歩行者行動予測を強化 Multimodal understanding with GPT-4o to enhance generalizable pedestrian behavior prediction
Je-Seok Ham, Jia Huang, Peng Jiang, Jinyoung Moon, Yongjin Kwon, Srikanth Saripalli, Changick Kim
Computers and Electrical Engineering Available online: 18 October 2025
DOI:https://doi.org/10.1016/j.compeleceng.2025.110741
Highlights
- First study applying GPT-4o in OmniPredict for pedestrian behavior prediction.
- Achieve 67% prediction accuracy in zero-shot setting without task-specific training.
- Surpass the latest MLLM baselines by 10% on pedestrian crossing intention prediction.
- Predict crossing, occlusion, action, and look using multi-contextual modalities.
- Demonstrate strong generalization across unseen driving scenarios without retraining.
Abstract
Pedestrian behavior prediction is one of the most critical tasks in urban driving scenarios, playing a key role in ensuring road safety. Traditional learning-based methods have relied on vision models for pedestrian behavior prediction. However, fully understanding pedestrians’ behaviors in advance is very challenging due to the complex driving environments and the multifaceted interactions between pedestrians and road elements. Additionally, these methods often show a limited understanding of driving environments not included in the training. The emergence of Multimodal Large Language Models (MLLMs) provides an innovative approach to addressing these challenges through advanced reasoning capabilities. This paper presents OmniPredict, the first study to apply GPT-4o(mni), a state-of-the-art MLLM, for pedestrian behavior prediction in urban driving scenarios. We assessed the model using the JAAD and WiDEVIEW datasets, which are widely used for pedestrian behavior analysis. Our method utilized multiple contextual modalities and achieved 67% accuracy in a zero-shot setting without any task-specific training, surpassing the performance of the latest MLLM baselines by 10%. Furthermore, when incorporating additional contextual information, the experimental results demonstrated a significant increase in prediction accuracy across four behavior types (crossing, occlusion, action, and look). We also validated the model s generalization ability by comparing its responses across various road environment scenarios. OmniPredict exhibits strong generalization capabilities, demonstrating robust decision-making in diverse and unseen driving rare scenarios. These findings highlight the potential of MLLMs to enhance pedestrian behavior prediction, paving the way for safer and more informed decision-making in road environments.


