AIが言葉を理解する仕組みを数学的にモデル化（How AIs understand words）

2025-06-20 スイス連邦工科大学ローザンヌ校（EPFL）

EPFLの研究チーム（Lenka Zdeborová教授ら）は、大規模言語モデル（LLM）がトークン列（単語や語の断片）として言語を処理する仕組みを解明するため、最小限の数学モデル「Bilinear Sequence Regression（BSR）」を開発しました。BSRは、各トークンを高次元ベクトルにベクトル化し、表形式で文全体を捉える手法を数理的に解析可能な形で再構築します。このモデルを通じ、なぜシーケンス構造が「平坦化（全ベクトル結合）」より学習に有効なのか、必要なデータ量のしきい値などが明確になりました。BSRは、学習が「意味を理解できる閾値」を超える瞬間を数学的に捉え、AI設計に理論的指針を提供します。将来のLLMの効率性や透明性向上につながる新たな視座を与える成果です。

＜関連情報＞

バイリニア配列回帰：高次元トークンのロングシーケンスからの学習モデル Bilinear Sequence Regression: A Model for Learning from Long Sequences of High-Dimensional Tokens

Vittorio Erba, Emanuele Troiani, Luca Biggio, Antoine Maillard, and Lenka Zdeborová
Physical Review X Published: 16 June, 2025
DOI: https://doi.org/10.1103/l4p2-vrxt

Abstract

Current progress in artificial intelligence is centered around so-called large language models that consist of neural networks processing long sequences of high-dimensional vectors called tokens. Statistical physics provides powerful tools to study the functioning of learning with neural networks and has played a recognized role in the development of modern machine learning. The statistical physics approach relies on simplified and analytically tractable models of data. However, simple tractable models for long sequences of high-dimensional tokens are largely underexplored. Inspired by the crucial role models such as the single-layer teacher-student perceptron (also known as generalized linear regression) played in the theory of fully connected neural networks, in this paper, we introduce and study the bilinear sequence regression (BSR) as one of the most basic models for sequences of tokens. We note that modern architectures naturally subsume the BSR model due to the skip connections. Building on recent methodological progress, we compute the Bayes-optimal generalization error for the model in the limit of long sequences of high-dimensional tokens and provide a message-passing algorithm that matches this performance. We quantify the improvement that optimal learning brings with respect to vectorizing the sequence of tokens and learning via simple linear regression. We also unveil surprising properties of the gradient descent algorithms in the BSR model.