security 重要度 4/5 2026/5/12 4:00:00

arXivでAIシステムの脆弱性研究論文公開、「Mitigating Many-shot Jailbreak Attacks with One Single Demonstration」

arXivから「arXivでAIシステムの脆弱性研究論文公開、「Mitigating Many-shot Jailbreak Attacks with One Single Demonstration」」に関する更新情報が公開されました。 arXiv に「Mitigating Many-shot Jailbreak Attacks with One Single Demonstration」が公開されました。研究段階の提案ですが、実装・評価・安全性の前提を見直す材料として注目できます。 arXiv:2605.08277v1 Announce Type: new Abstract: Many-shot jailbreaking (MSJ) causes safety-aligned language models to answer harmful queries by preceding them with many harmful question-answer demonstrations. We study why this attack becomes stronger as the number of demonstrations increases. Empirically, we find that MSJ induces a progressive activation drift: the representation of a fixed harmful query moves step by step away from the safety-aligned region as more harmful demonstrations are added. Theoretically, we show that this drift can be interpreted as implicit malicious fine-tuning: conditioning on N harmful demonstrations induces SGD-style updates equivalent to optimizing on the corresponding N harmful samples. This view turns the attack mechanism into a defense principle. We append a fixed one-shot safety demonstration at inference time, which induces a counteracting safety-oriented update and restores refusal behavior. The resulting method improves the model's robustness to MSJ without modifying its parameters or requiring white-box access at deployment. Code is available at https://github.com/Thecommonirin/SafeEnd. 実務では、論文の主張だけでなく、評価データ、攻撃モデル、再現条件、ツール依存の前提を確認してから応用範囲を判断する必要があります。実運用では、対象バージョン・権限設定・ロールバック条件を事前に固定し、ステージング検証を経て段階反映することで影響を抑えられます。

Related tools