WDCD Run #164: Average Instruction Decay Hits -44.3% Across 11 Frontier Models

2026年6月11日 356 约6分钟 Winzheng Research Lab

WDCD AI benchmark instruction decay multi-turn commitment test

The Winzheng Dynamic Contextual Decay (WDCD) benchmark measures how AI models' commitment to user instructions degrades over multi-turn dialogue. In Run #164, completed on 2026-06-11, 11 models were evaluated across three rounds, producing an average instruction decay of -44.3% from Round 1 to Round 3.

WDCD's three-round protocol tests instruction acknowledgment (R1), distractor resistance after injecting 2,000–5,000-word professional documents (R2), and final constraint integrity (R3). All scoring is 100% rule-based, with zero AI judges in the loop. The 30-question set spans five real-world scenarios: data_boundary, resource_limit, business_rule, security, and engineering.

Leaderboard — Top 3:

GPT-5.5 — 88.3 pts (decay: -67%)
Gemini 3.1 Pro — 87.5 pts (decay: -60%)
Claude Sonnet 4.6 — 83.3 pts (decay: -57.7%)

The top three models posted the highest absolute scores but, notably, all three also exhibited substantial decay between rounds — confirming a pattern seen in prior runs: peak capability does not guarantee multi-turn commitment. Higher initial compliance often leaves more room to fall once distractor documents are introduced in R2.

Decay extremes:

Worst decay: GPT-o3 at -24.7%, indicating significant erosion of instruction adherence by R3 despite a stable R1 baseline.
Best decay resistance: 豆包 Pro at -110%, the strongest resistance metric recorded in this run, meaning its R3 constraint integrity actually exceeded its R1 baseline under the WDCD scoring formula.

The -44.3% run-wide average reinforces a structural finding from previous WDCD runs: instruction decay is not a tail-risk phenomenon but a baseline characteristic of current frontier models. When long professional documents are introduced mid-conversation, the majority of tested models partially or fully release earlier constraints — even when those constraints were explicitly acknowledged in R1.

Compared to prior runs, the relative ordering at the top remains consistent, but the gap between raw capability scores and decay-resistance scores continues to widen. This suggests that multi-turn commitment is emerging as a distinct evaluation axis, not a derivative of single-turn capability. Models optimized for benchmark accuracy do not automatically inherit constraint persistence.

For scenario-level breakdowns, scoring rubrics, and the full R1/R2/R3 protocol specification, see the WDCD methodology document: https://www.winzheng.com/yz-index/methodology

Raw run data is available via the WDCD data API: https://www.winzheng.com/yz-index/api/v1/dcd

WDCD Run #164: Average Instruction Decay Hits -44.3% Across 11 Frontier Models

相关文章