2026-05-25 産業技術総合研究所

図1 Multi-Object in 3D Dataset(MO3D)の概要。構築したデータセットには3種類の質問応答課題があり、全7万件の三次元点群と質問応答のペアデータを含む。
※原論文の図を引用・改変したものを使用しています。
<関連情報>
- https://www.aist.go.jp/aist_j/press_release/pr2026/pr20260525/pr20260525.html
- https://openaccess.thecvf.com/content/CVPR2026F/papers/Ide_Beyond_Single_Object_Learning_3D_Relations_with_Large_Language_Models_CVPRF_2026_paper.pdf
単一オブジェクトを超えて:大規模言語モデルを用いた3D関係の学習 Beyond Single Object: Learning 3D Relations with Large Language Models
Kohsuke Ide, Ryousuke Yamada, Yue Qiu, Xianzheng Ma, Yoshihiro Fukuhara, Hirokatsu Kataoka, Yutaka Satoh
The IEEE/CVF Conference on Computer Vision and Pattern Recognition 2026
Abstract
We address a fundamental gap in 3D-LLMs: existing models focus on single-object/scene description, struggling with detailed, inter-object comparison. We propose a framework for detailed object-level reasoning across multiple objects with three components: (1) MO3D (MultiObject in 3D), an instruction dataset requiring fine-grained multi-object comparison; (2) Multi-3DLLM, using a minimal Patch-Interaction Transformer (PIT) that models inter/intra-object relationships while preserving local geometry; (3) Mini-apps, two application-driven benchmarks (Shape Mating, Change Captioning) that probe geometric understanding for practical use. Recent 3D-LLMs and 2D-VLMs perform poorly on these tasks, lacking both comparisoncentric design and geometric awareness. In contrast, Multi3DLLM trained on our mixture data learns geometric reasoning, surpasses all baselines on MO3D, and provides positive transfer to single-object classification.

