← 一覧へ戻る
devops 重要度 4/5 2026/4/30 4:00:00

arXivでAI評価・信頼性研究論文公開、「Operating-Layer Controls for Onchain Language-Model Agents Under Real C…」

arXivでAI評価・信頼性研究論文公開、「Operating-Layer Controls for Onchain Language-Model Agents Under Real C…」

arXiv に「Operating-Layer Controls for Onchain Language-Model Agents Under Real Capital」が公開されました。研究段階の提案ですが、実装・評価・安全性の前提を見直す材料として注目できます。

arXiv:2604.26091v1 Announce Type: new Abstract: We study reliability in autonomous language-model agents that translate user mandates into validated tool actions under real capital. The setting is DX Terminal Pro, a 21-day deployment in which 3,505 user-funded agents traded real ETH in a bounded onchain market. Users configured vaults through structured controls and natural-language strategies, but only agents could choose normal buy/sell trades. The system produced 7.5M agent invocations, roughly 300K onchain actions, about $20M in volume, more than 5,000 ETH deployed, roughly 70B inference tokens, and 99.9% settlement success for policy-valid submitted transactions. Long-running agents accumulated thousands of sequential decisions, including 6,000+ prompt-state-action cycles for continuously active agents, yielding a large-scale trace from user mandate to rendered prompt, reasoning, validation, portfolio state, and settlement. Reliability did not come from the base model alone; it emerged from the operating layer around the model: prompt compilation, typed controls, policy validation, execution guards, memory design, and trace-level observability. Pre-launch testing exposed failures that text-only benchmarks rarely measure, including fabricated trading rules, fee paralysis, numeric anchoring, cadence trading, and misread tokenomics. Targeted harness changes reduced fabricated sell rules from 57% to 3%, reduced fee-led observations from 32.5% to below 10%, and increased capital deployment from 42.9% to 78.0% in an affected test populati…

Related tools

この記事に関連するおすすめツール

比較検討しやすい導入候補を優先して表示しています。一部リンクは広告・アフィリエイトを含む場合があります。

フェレット記者の用語メモ

arxiv

arxivは用語だけでなく、何を改善できる技術なのかを押さえると実務で活きるよ。

比較: baseline

research

researchは用語だけでなく、何を改善できる技術なのかを押さえると実務で活きるよ。

比較: baseline

出典: arXiv

要点を短く整理して掲載しています。詳細は出典を確認してください。

朝の要約メール待機リスト

毎朝7時に「今日の3本」をメールで受け取る(先行導入)。

関連記事