フィルタリングされたデータで公開AIモデルの危険タスク実行を阻止(Study finds filtered data stops openly-available AI models from performing dangerous tasks)

2025-08-12 オックスフォード大学

オックスフォード大学、EleutherAI、英国AIセキュリティ研究所の共同研究は、オープンウェイトAIモデルの学習データから生物兵器やバイオテロ関連など高リスク情報を除去することで、安全性を高めつつ性能を維持できることを示した。Pythia-6.9Bを用い、危険知識をフィルタリングしたデータで再学習した結果、生物リスク関連タスクでの応答能力は大幅に低下し、安全性ベンチマークで高評価を獲得。一方、一般タスク性能はほぼ維持され、バイオ分野の高度問題のみ軽微な精度低下にとどまった。計算コスト増は1%未満で効率的。研究は、全データ使用が必須という従来の常識に挑み、安全で公開可能なAI開発の新たな方向性を示す。

<関連情報>

深い無知:事前学習データのフィルタリングがオープンウェイトLLMに改ざん耐性のある安全対策を組み込む Deep Ignorance: Filtering Pretraining Data Builds Tamper-Resistant Safeguards into Open-Weight LLMs

Kyle O’Brien, Stephen Casper, Quentin Anthony, Tomek Korbak, Robert Kirk, Xander Davies, Ishan Mishra, Geoffrey Irving, Yarin Gal, Stella Biderman
arXiv  Submitted on 8 Aug 2025
DOI:https://doi.org/10.48550/arXiv.2508.06601

フィルタリングされたデータで公開AIモデルの危険タスク実行を阻止(Study finds filtered data stops openly-available AI models from performing dangerous tasks)

Abstract

Open-weight AI systems offer unique benefits, including enhanced transparency, open research, and decentralized access. However, they are vulnerable to tampering attacks which can efficiently elicit harmful behaviors by modifying weights or activations. Currently, there is not yet a robust science of open-weight model risk management. Existing safety fine-tuning methods and other post-training techniques have struggled to make LLMs resistant to more than a few dozen steps of adversarial fine-tuning. In this paper, we investigate whether filtering text about dual-use topics from training data can prevent unwanted capabilities and serve as a more tamper-resistant safeguard. We introduce a multi-stage pipeline for scalable data filtering and show that it offers a tractable and effective method for minimizing biothreat proxy knowledge in LLMs. We pretrain multiple 6.9B-parameter models from scratch and find that they exhibit substantial resistance to adversarial fine-tuning attacks on up to 10,000 steps and 300M tokens of biothreat-related text — outperforming existing post-training baselines by over an order of magnitude — with no observed degradation to unrelated capabilities. However, while filtered models lack internalized dangerous knowledge, we find that they can still leverage such information when it is provided in context (e.g., via search tool augmentation), demonstrating a need for a defense-in-depth approach. Overall, these findings help to establish pretraining data curation as a promising layer of defense for open-weight AI systems.

1604情報ネットワーク
ad
ad
Follow
ad
タイトルとURLをコピーしました