AIよりも人間が「空気を読む」能力に優れる（Humans are better than AI at reading the room）

2025-04-25

2025-04-24 ジョンズ・ホプキンス大学（JHU）

ジョンズ・ホプキンス大学の研究チームは、AIが人間のように社会的な文脈を理解する能力に限界があることを明らかにしました。研究では、3秒間の短い動画を人間とAIモデルに評価させたところ、人間は一貫した判断を示したのに対し、AIは社会的相互作用の理解において一貫性を欠いていました。これは、現在のAIが静的な画像処理に基づいて設計されており、動的な社会的シーンの処理には適していないことが原因とされています。この研究は、AIが人間と効果的に相互作用するためには、社会的文脈の理解能力を向上させる必要があることを示唆しています。

＜関連情報＞

ダイナミックな社会的視覚のモデル化により、ディープラーニングと人間のギャップが浮き彫りになる Modeling dynamic social vision highlights gaps between deep learning and humans

Kathy Garcia · Emalie McMahon · Colin Conwell · Michael Bonner · Leyla Isik
International Conference on Learning Representations April 24, 2025

Abstract

Deep learning models trained on computer vision tasks are widely considered the most successful models of human vision to date. The majority of work that supports this idea evaluates how accurately these models predict behavior and brain responses to static images of objects and scenes. Real-world vision, however, is highly dynamic, and far less work has evaluated deep learning models on human responses to moving stimuli, especially those that involve more complicated, higher-order phenomena like social interactions. Here, we extend a dataset of natural videos depicting complex multi-agent interactions by collecting human-annotated sentence captions for each video, and we benchmark 350+ image, video, and language models on behavior and neural responses to the videos. As in prior work, we find that many vision models reach the noise ceiling in predicting visual scene features and responses along the ventral visual stream (often considered the primary neural substrate of object and scene recognition). In contrast, vision models poorly predict human action and social interaction ratings and neural responses in the lateral stream (a neural pathway theorized to specialize in dynamic, social vision), though video models show a striking advantage in predicting mid-level lateral stream regions. Language models (given human sentence captions of the videos) predict action and social ratings better than image and video models, but perform poorly at predicting neural responses in the lateral stream. Together, these results identify a major gap in AI’s ability to match human social vision and provide insights to guide future model development for dynamic, natural contexts.

月	火	水	木	金	土	日
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30