WDCD Run #140: Qwen3 Max Leads with 17% Instruction Decay as Average Hits 36.5%

WDCD Run #140: Qwen3 Max Leads with 17% Instruction Decay as Average Hits 36.5%

The Winzheng Dynamic Contextual Decay (WDCD) benchmark measures how reliably AI models maintain user-specified constraints across multi-turn dialogue. In Run #140, conducted on 2026-05-31 across 11 models, the average commitment decay from Round 1 to Round 3 reached 36.5% — confirming that instruction decay remains a structural weakness in current frontier systems.

WDCD scores are produced through a three-round protocol: R1 verifies initial instruction acknowledgment, R2 introduces 2,000–5,000 word professional distractor documents to test resistance, and R3 performs a final constraint integrity check. All 30 questions across five scenarios — data_boundary, resource_limit, business_rule, security, and engineering — are scored under a 100% rule-based system with zero AI judges.

Top 3 results in Run #140:

  • Qwen3 Max — 70.8 pts, decay -17%
  • Claude Sonnet 4.6 — 66.7 pts, decay -30%
  • Gemini 3.1 Pro — 66.7 pts, decay -23%

Qwen3 Max delivered both the highest absolute score and the strongest decay resistance in the cohort, holding 83% of its initial commitment posture through R3. Claude Sonnet 4.6 and Gemini 3.1 Pro tied on points but diverged on stability, with Gemini 3.1 Pro losing 7 percentage points less under distractor pressure.

At the bottom of the decay distribution, Grok 4 recorded a -83% decay, the worst figure of the run. This indicates near-total erosion of multi-turn commitment after exposure to the R2 professional document payload — a pattern consistent with models that acknowledge constraints fluently in R1 but fail to re-anchor them once long intervening context is introduced.

Decay pattern observations:

  • Scores in R1 across the cohort were tightly clustered; differentiation emerged almost entirely in R2 and R3.
  • The 66-point cluster (Claude Sonnet 4.6, Gemini 3.1 Pro) suggests a ceiling for current general-purpose models that lack dedicated long-horizon constraint mechanisms.
  • The gap between best (-17%) and worst (-83%) decay rates widened compared with prior runs, indicating that multi-turn commitment is becoming a primary axis of model differentiation rather than a uniform weakness.

The 36.5% average decay reinforces that benchmarks measuring only single-turn instruction following overstate real-world reliability. For deployments involving policy enforcement, security boundaries, or resource limits, R3 integrity is the metric that matters.

Full methodology: https://www.winzheng.com/yz-index/methodology
Structured data API: https://www.winzheng.com/yz-index/api/v1/dcd