AIモデルを敵対的攻撃から防ぐ新技術(New AI defense method shields models from adversarial attacks)

2025-03-05 ロスアラモス国立研究所（LANL）

ロスアラモス国立研究所（LANL）の研究者たちは、非負値行列因子分解（NMF）アルゴリズムに対する新たな攻撃手法「潜在特徴攻撃（LaFA）」を開発しました。従来、NMFは攻撃に対して堅牢とされていましたが、LaFAはNMFプロセスで生成される潜在特徴を操作し、元のデータに微小な摂動を加えることで、抽出される潜在特徴に大きな影響を与えることが可能です。この手法は、勾配逆伝播のメモリ使用量を削減するために、暗黙的微分法を採用し、大規模データセットへのスケーリングを可能にしています。この研究は、NMFの脆弱性を明らかにし、他の機械学習技術と同様に攻撃のリスクが存在することを示しています。

＜関連情報＞

LoRID: 逆説的純化のための低ランク反復拡散 LoRID: Low-Rank Iterative Diffusion for Adversarial Purification

Geigh Zollicoffer, Minh Vu, Ben Nebgen, Juan Castorena, Boian Alexandrov, Manish Bhattarai
arXiv Submitted on 12 Sep 2024
DOI:https://doi.org/10.48550/arXiv.2409.08255

Abstract

This work presents an information-theoretic examination of diffusion-based purification methods, the state-of-the-art adversarial defenses that utilize diffusion models to remove malicious perturbations in adversarial examples. By theoretically characterizing the inherent purification errors associated with the Markov-based diffusion purifications, we introduce LoRID, a novel Low-Rank Iterative Diffusion purification method designed to remove adversarial perturbation with low intrinsic purification errors. LoRID centers around a multi-stage purification process that leverages multiple rounds of diffusion-denoising loops at the early time-steps of the diffusion models, and the integration of Tucker decomposition, an extension of matrix factorization, to remove adversarial noise at high-noise regimes. Consequently, LoRID increases the effective diffusion time-steps and overcomes strong adversarial attacks, achieving superior robustness performance in CIFAR-10/100, CelebA-HQ, and ImageNet datasets under both white-box and black-box settings.

月	火	水	木	金	土	日
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30
31