The Winzheng Dynamic Contextual Decay (WDCD) benchmark measures how AI models' commitment to user instructions decays over multi-turn dialogue. In Run #196, conducted on 2026-06-24, the average instruction decay across 11 evaluated models reached -39.9% between Round 1 and Round 3.
WDCD uses a three-round structure: R1 verifies initial instruction acknowledgment, R2 tests distractor resistance after injecting 2000–5000 word professional documents, and R3 performs a final constraint integrity check. Scoring is 100% rule-based with zero AI judges, covering 30 questions across five real-world scenarios: data_boundary, resource_limit, business_rule, security, and engineering.
Run #196 Top 3:
- Qwen3 Max — 92.5 pts (decay: -90%)
- Gemini 3.1 Pro — 87.5 pts (decay: -70%)
- Grok 4 — 82.5 pts (decay: -30%)
The leaderboard reveals a notable tension between absolute score and decay resistance. Qwen3 Max retained the top position on cumulative points, but its -90% decay curve indicates that most of its scoring strength was concentrated in Round 1, with sharp degradation under distractor pressure and final constraint checks. Gemini 3.1 Pro followed a similar but less severe trajectory at -70%, while Grok 4 demonstrated a flatter multi-turn commitment profile at -30%.
Decay extremes: Among the 11 models tested, GPT-o3 registered the worst decay at -30% relative to its own R1 baseline within the lower-scoring cohort, indicating it lost a substantial portion of instruction compliance by R3. In contrast, 豆包 Pro (Doubao Pro) recorded the best decay resistance in this run at -166.7%, a negative-decay reading that under WDCD's formula indicates the model actually improved its constraint adherence between R1 and R3 — a rare pattern typically associated with models that under-commit in early rounds but stabilize once context is fully loaded.
The gap between top-ranked Qwen3 Max and third-placed Grok 4 (10 points) is narrower than the gap in decay magnitude (60 percentage points), reinforcing a consistent WDCD finding: high R1 scores do not predict R3 integrity. Models optimized for single-turn instruction following continue to show vulnerability once professional-length distractor documents are introduced in R2.
Full methodology: https://www.winzheng.com/yz-index/methodology
Raw data API: https://www.winzheng.com/yz-index/api/v1/dcd
© 2026 Winzheng.com 赢政天下 | 转载请注明来源并附原文链接