双曲幾何と情報理論を統合したデータ蒸留手法の開発~大規模AI学習の省メモリ化・高効率化への貢献に期待~

2025-10-28 北海道大学

北海道大学情報科学研究院の長谷山美紀教授らは、双曲幾何学と情報理論を統合した新しい「データセット蒸留」手法を開発した。これは、AIモデルの学習効率を維持しながらデータ量を大幅に削減できる技術である。研究チームは、まず双曲幾何空間を導入し、クラス間の階層的意味関係を保ったまま元データと合成データの分布を整合化。さらに情報理論に基づき、拡散生成過程の情報量を制御することで、少量生成時の多様性不足を抑制した。これにより「意味構造を保持する圧縮」と「偏りの少ない合成データ生成」を両立し、学習時間・メモリ・ストレージ負荷を軽減できる。双曲幾何をデータ蒸留に応用したのは世界初であり、AI学習の省資源化に新たな道を開く成果。研究はNeurIPS 2025本会議およびWorkshopに採択された。

双曲幾何と情報理論を統合したデータ蒸留手法の開発~大規模AI学習の省メモリ化・高効率化への貢献に期待~
提案⼿法による蒸留した画像の例

<関連情報>

双曲データセット蒸留 Hyperbolic Dataset Distillation

Wenyuan Li, Guang Li, Keisuke Maeda, Takahiro Ogawa, Miki Haseyama
arXiv  last revised 17 Oct 2025 (this version, v2)
DOI:https://doi.org/10.48550/arXiv.2505.24623

Abstract

To address the computational and storage challenges posed by large-scale datasets in deep learning, dataset distillation has been proposed to synthesize a compact dataset that replaces the original while maintaining comparable model performance. Unlike optimization-based approaches that require costly bi-level optimization, distribution matching (DM) methods improve efficiency by aligning the distributions of synthetic and original data, thereby eliminating nested optimization. DM achieves high computational efficiency and has emerged as a promising solution. However, existing DM methods, constrained to Euclidean space, treat data as independent and identically distributed points, overlooking complex geometric and hierarchical relationships. To overcome this limitation, we propose a novel hyperbolic dataset distillation method, termed HDD. Hyperbolic space, characterized by negative curvature and exponential volume growth with distance, naturally models hierarchical and tree-like structures. HDD embeds features extracted by a shallow network into the Lorentz hyperbolic space, where the discrepancy between synthetic and original data is measured by the hyperbolic (geodesic) distance between their centroids. By optimizing this distance, the hierarchical structure is explicitly integrated into the distillation process, guiding synthetic samples to gravitate towards the root-centric regions of the original data distribution while preserving their underlying geometric characteristics. Furthermore, we find that pruning in hyperbolic space requires only 20% of the distilled core set to retain model performance, while significantly improving training stability. To the best of our knowledge, this is the first work to incorporate the hyperbolic space into the dataset distillation process. The code is available at this https URL.

 

データセット蒸留のための情報誘導拡散サンプリング Information-Guided Diffusion Sampling for Dataset Distillation

Linfeng Ye, Shayan Mohajer Hamidi, Guang Li, Takahiro Ogawa, Miki Haseyama, Konstantinos N. Plataniotis
arXiv  Submitted on 7 Jul 2025
DOI:https://doi.org/10.48550/arXiv.2507.04619

Abstract

Dataset distillation aims to create a compact dataset that retains essential information while maintaining model performance. Diffusion models (DMs) have shown promise for this task but struggle in low images-per-class (IPC) settings, where generated samples lack diversity. In this paper, we address this issue from an information-theoretic perspective by identifying two key types of information that a distilled dataset must preserve: ( i ) prototype information I( X ; Y ), which captures label-relevant features; and (ii ) contextual information H( X|Y ), which preserves intra-class variability. Here, (X, Y ) represents the pair of random variables corresponding to the input data and its ground truth label, respectively. Observing that the required contextual information scales with IPC, we propose maximizing I( X ; Y ) + βH( X|Y ) during the DM sampling process, where β is IPC-dependent. Since directly computing I( X ; Y ) and H( X|Y ) is intractable, we develop variational estimations to tightly lower-bound these quantities via a data-driven approach. Our approach, informationguided diffusion sampling (IGDS), seamlessly integrates with diffusion models and improves dataset distillation across all IPC settings. Experiments on Tiny ImageNet and ImageNet subsets show that IGDS significantly outperforms existing methods, particularly in low-IPC regimes. The code will be released upon acceptance.

1603情報システム・データ工学
ad
ad
Follow
ad
タイトルとURLをコピーしました