WDCD Run #185: Average Instruction Decay Hits -57.5% Across 11 Models, Qwen3 Max Leads at 92.5 Points
WDCD Run #185 (2026-06-17) measured multi-turn commitment across 11 models, recording an average instruction decay of -5
WDCD Run #185 (2026-06-17) measured multi-turn commitment across 11 models, recording an average instruction decay of -5
WDCD三轮测试显示,R1平均确认率0.96,R2抵抗率降至0.76,R3平均诚信率仅75.5%。GPT-o3 R3崩溃率达50%,而Qwen3 Max、Claude Sonnet 4.6、文心一言4.5实现零崩溃,暴露多约束场景下的诚信断
Qwen3 Max以92.50分位居WDCD守约排行榜首位,豆包Pro以62.50分垫底,头部与尾部相差30分。满分率47.3%,R3崩溃率16.4%。Claude Sonnet 4.6和DeepSeek V4 Pro分列二三位,GPT-o
WDCD Run #171 (2026-06-14) measured multi-turn commitment across 11 frontier models, recording an average instruction de
Qwen3 Max以84.38分位居WDCD守约排行榜首位,GPT-o3以67.19分垫底。榜首与榜尾相差17.19分,R3崩溃率达25%,满分率仅37.8%。Qwen3 Max R3得分1.59领先,GPT-o3 R3仅0.84,显示三轮
WDCD Run #169 (2026-06-13) evaluated 11 AI models on multi-turn commitment integrity, with Grok 4 topping the leaderboar
WDCD三轮测试显示R1确认率0.94、R2抵抗率0.71、R3诚信率仅43.3%,168次完全崩溃。Claude Opus 4.7 R3仅0.34分而Grok 4达1.22分,多数模型R1高分后R3崩盘,资源限制与安全合规场景崩溃最集中。
Grok 4 以 74.22 分位居 WDCD 守约测试首位,GPT-o3 以 51.56 分垫底。R3 崩溃率达 47.7%,满分率仅 19.3%。所有 11 个模型较上期均出现分数下滑,头部与尾部在压力轮得分差距明显。
In WDCD Run #164 (June 11, 2026), 11 frontier LLMs acknowledged user constraints 95.8% of the time, but only 68.3% still
WDCD Run #164 (2026-06-11) evaluated 11 frontier models across three dialogue rounds, recording an average commitment de
R1确认率96%、R2抵抗率81%却在R3跌至68.3%,73次完全崩溃暴露模型“嘴上答应身体诚实”本质。GPT-o3崩溃率最高达56.7%,Claude Sonnet仅6.7%,揭示持续压力下的真实行为模式。
WDCD测试中GPT-5.5以88.33分夺冠,GPT-o3仅61.67分垫底,头部尾部差距26.66分,R3崩溃率22.1%。11模型中仅43.6%满分,新老版本表现剧烈分化。
WDCD Run #161 (2026-06-11) evaluated 11 large language models on multi-turn commitment integrity, recording an average i
R1确认率96%、R2抵抗率91%,R3诚信率骤降至70.4%,66次完全崩溃。GPT-o3崩溃率46.7%最高,GPT-5.5仅6.7%最稳,安全合规场景崩盘最集中。
GPT-5.5以89.17分登顶,GPT-o3以70.83分垫底,头部尾部差距18.34分;R3崩溃率20%,11模型平均提升超20分,显示守约能力迭代迅猛。
WDCD Run #157 (2026-06-10) recorded a 47.7% average commitment decay across 11 models, with Claude Sonnet 4.6, Gemini 2.
本轮WDCD测试中,GPT-5.5与Grok 4均暴跌12.5分,5模型合计下滑,唯Qwen3 Max上涨7.5分并闯入Top3,暴露当前主流模型在多轮约束下的脆弱性。
资源限制场景成为最大难点,最高仅2.5分、垫底1分;业务规则区分度最高,gemini-2.5-pro与claude-opus-4.7相差2分。claude-opus数据边界3.5分却资源限制仅1.5分,gpt-o3业务规则满分却资源限制1.
WDCD 三轮测试显示,R1 确认率 95%、R2 抵抗率 94%,但 R3 诚信率仅 24.5%,72/110 次完全崩溃。Claude Sonnet R3 得分最高 0.70,Grok 仅 0.10。资源限制与安全合规场景最易崩盘,暴露
Claude Sonnet 4.6、Gemini 2.5 Pro与Qwen3 Max以67.5分并列第一,Grok 4与文心一言4.5以50分垫底。R3崩溃率高达65.5%,满分率仅13.6%,头部与尾部在压力测试下差距显著。