AIチャットボットが有害な返答をするのを防ぐ、より迅速で優れた方法(A faster, better way to prevent an AI chatbot from giving toxic responses)

2024-04-10 マサチューセッツ工科大学(MIT)

MITの研究者は、機械学習を用いてレッドチームのプロセスを改善し、チャットボットから広範囲の有害な応答を引き出すための多様なプロンプトを自動生成する手法を開発しました。この技術では、レッドチームモデルに好奇心を持たせ、新しいプロンプトを生成することに焦点を当てます。結果として、この手法は従来の手法よりも効果的であり、人間の専門家によってセーフガードされたチャットボットからも有毒な応答を引き出すことができます。

＜関連情報＞

大規模言語モデルのための好奇心駆動型レッドチーミング Curiosity-driven Red-teaming for Large Language Models

Zhang-Wei Hong, Idan Shenfeld, Tsun-Hsuan Wang, Yung-Sung Chuang, Aldo Pareja, James Glass, Akash Srivastava, Pulkit Agrawal
arXiv Submitted on:29 Feb 2024
DOI:https://doi.org/10.48550/arXiv.2402.19464

Abstract

Large language models (LLMs) hold great potential for many natural language applications but risk generating incorrect or toxic content. To probe when an LLM generates unwanted content, the current paradigm is to recruit a \textit{red team} of human testers to design input prompts (i.e., test cases) that elicit undesirable responses from LLMs. However, relying solely on human testers is expensive and time-consuming. Recent works automate red teaming by training a separate red team LLM with reinforcement learning (RL) to generate test cases that maximize the chance of eliciting undesirable responses from the target LLM. However, current RL methods are only able to generate a small number of effective test cases resulting in a low coverage of the span of prompts that elicit undesirable responses from the target LLM. To overcome this limitation, we draw a connection between the problem of increasing the coverage of generated test cases and the well-studied approach of curiosity-driven exploration that optimizes for novelty. Our method of curiosity-driven red teaming (CRT) achieves greater coverage of test cases while mantaining or increasing their effectiveness compared to existing methods. Our method, CRT successfully provokes toxic responses from LLaMA2 model that has been heavily fine-tuned using human preferences to avoid toxic outputs. Code is available at \url{this https URL}

月	火	水	木	金	土	日
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30