強力なAIがなぜ基本的な掛け算を学べないのか？（Why can’t powerful AIs learn basic multiplication?）

2025-12-24 シカゴ大学

最新の研究によれば、現代の大規模言語モデル（LLM）は複雑な文章生成や高度な推論が可能であるにもかかわらず、 4桁同士の掛け算 といった基礎的な算術ではほとんど正解できないという。研究チーム（シカゴ大・MIT・ハーバード大など）はこの原因を「長距離依存（long-range dependencies）」の欠如だと指摘する。掛け算は途中の部分積や繰り上がりを保持しそれを後で利用する必要があるが、従来の 標準ファインチューニング（SFT） ではモデルが途中の情報を内部に保持・再利用する仕組みを自発的に学べず、局所的最適解に陥ってしまう。これに対し、Implicit Chain of Thought（ICoT） という訓練法では、中間の計算ステップの情報を内部状態に蓄えられるようになり、完全な正解率を達成した。さらに「部分和」を予測する補助的な損失関数を与えることで、SFTモデルでも高い正答率が得られるようになり、メモリとして情報を扱う仕組みづくりが鍵であることが示された。これらは単なる掛け算だけでなく、言語モデル一般の学習・推論過程にも示唆を与える。

＜関連情報＞

トランスフォーマーはなぜ掛け算を学習できないのか？リバースエンジニアリングで長距離依存関係の落とし穴が明らかに Why Can’t Transformers Learn Multiplication? Reverse-Engineering Reveals Long-Range Dependency Pitfalls

Xiaoyan Bai, Itamar Pres, Yuntian Deng, Chenhao Tan, Stuart Shieber, Fernanda Viégas, Martin Wattenberg, Andrew Lee
arXiv Submitted on 30 Sep 2025
DOI:https://doi.org/10.48550/arXiv.2510.00184

Abstract

Language models are increasingly capable, yet still fail at a seemingly simple task of multi-digit multiplication. In this work, we study why, by reverse-engineering a model that successfully learns multiplication via \emph{implicit chain-of-thought}, and report three findings: (1) Evidence of long-range structure: Logit attributions and linear probes indicate that the model encodes the necessary long-range dependencies for multi-digit multiplication. (2) Mechanism: the model encodes long-range dependencies using attention to construct a directed acyclic graph to “cache” and “retrieve” pairwise partial products. (3) Geometry: the model implements partial products in attention heads by forming Minkowski sums between pairs of digits, and digits are represented using a Fourier basis, both of which are intuitive and efficient representations that the standard fine-tuning model lacks. With these insights, we revisit the learning dynamics of standard fine-tuning and find that the model converges to a local optimum that lacks the required long-range dependencies. We further validate this understanding by introducing an auxiliary loss that predicts the “running sum” via a linear regression probe, which provides an inductive bias that enables the model to successfully learn multi-digit multiplication. In summary, by reverse-engineering the mechanisms of an implicit chain-of-thought model we uncover a pitfall for learning long-range dependencies in Transformers and provide an example of how the correct inductive bias can address this issue.

月	火	水	木	金	土	日
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31