The Winzheng Dynamic Contextual Decay (WDCD) benchmark measures how AI models' adherence to user instructions deteriorates across multi-turn dialogue. In Run #202, completed on 2026-06-28, the 11 evaluated models exhibited an average instruction decay of -73.2% between Round 1 and Round 3.
WDCD uses a three-round protocol: R1 captures initial instruction acknowledgment, R2 tests distractor resistance after the model processes professional documents of 2,000–5,000 words, and R3 performs a final constraint integrity check. Scoring is 100% rule-based with zero AI judges, applied across 30 questions spanning five real-world scenario classes: data_boundary, resource_limit, business_rule, security, and engineering.
Top 3 Results (Run #202):
- Gemini 3.1 Pro — 93.6 pts, -77% decay
- Grok 4 — 92.9 pts, -83% decay
- Claude Opus 4.7 — 89.3 pts, -69% decay
Among the leaders, Claude Opus 4.7 posted the lowest decay rate of the top three at -69%, indicating stronger multi-turn commitment retention relative to peers at similar score levels. Grok 4 achieved a near-tie on raw points despite the highest decay of the top group, suggesting a strong R1 baseline that absorbed sharper R3 erosion.
Decay Pattern Highlights:
- Best decay resistance: 豆包 Pro at -147%. The negative-beyond-baseline figure indicates the model's R3 compliance dropped substantially further than the cohort average — making it the weakest model in this run for retaining constraints under distractor pressure.
- Lowest decay magnitude: GPT-o3 at -34%, the smallest drop in the cohort, indicating the most stable multi-turn commitment profile this run despite not ranking in the top three on absolute score.
The cohort-wide -73.2% average confirms a persistent pattern observed across recent WDCD runs: most frontier models acknowledge constraints clearly in R1, but commitment integrity erodes substantially once long professional documents are inserted between the instruction and the final check. The gap between absolute score leaders (Gemini 3.1 Pro, Grok 4) and decay-resistance leaders (GPT-o3) reinforces that high R1 scores do not guarantee durable multi-turn commitment.
Full methodology, scenario definitions, and rule-based scoring logic: https://www.winzheng.com/yz-index/methodology
Structured run data is available via the public API: https://www.winzheng.com/yz-index/api/v1/dcd
© 2026 Winzheng.com 赢政天下 | 转载请注明来源并附原文链接