WDCD Run #202: Average Instruction Decay Hits -73.2% Across 11 Models, Gemini 3.1 Pro Leads

2026年6月28日 25 约5分钟 Winzheng Research Lab

WDCD AI benchmark instruction decay multi-turn commitment test

The Winzheng Dynamic Contextual Decay (WDCD) benchmark measures how AI models' adherence to user instructions deteriorates across multi-turn dialogue. In Run #202, completed on 2026-06-28, the 11 evaluated models exhibited an average instruction decay of -73.2% between Round 1 and Round 3.

WDCD uses a three-round protocol: R1 captures initial instruction acknowledgment, R2 tests distractor resistance after the model processes professional documents of 2,000–5,000 words, and R3 performs a final constraint integrity check. Scoring is 100% rule-based with zero AI judges, applied across 30 questions spanning five real-world scenario classes: data_boundary, resource_limit, business_rule, security, and engineering.

Top 3 Results (Run #202):

Gemini 3.1 Pro — 93.6 pts, -77% decay
Grok 4 — 92.9 pts, -83% decay
Claude Opus 4.7 — 89.3 pts, -69% decay

Among the leaders, Claude Opus 4.7 posted the lowest decay rate of the top three at -69%, indicating stronger multi-turn commitment retention relative to peers at similar score levels. Grok 4 achieved a near-tie on raw points despite the highest decay of the top group, suggesting a strong R1 baseline that absorbed sharper R3 erosion.

Decay Pattern Highlights:

Best decay resistance: 豆包 Pro at -147%. The negative-beyond-baseline figure indicates the model's R3 compliance dropped substantially further than the cohort average — making it the weakest model in this run for retaining constraints under distractor pressure.
Lowest decay magnitude: GPT-o3 at -34%, the smallest drop in the cohort, indicating the most stable multi-turn commitment profile this run despite not ranking in the top three on absolute score.

The cohort-wide -73.2% average confirms a persistent pattern observed across recent WDCD runs: most frontier models acknowledge constraints clearly in R1, but commitment integrity erodes substantially once long professional documents are inserted between the instruction and the final check. The gap between absolute score leaders (Gemini 3.1 Pro, Grok 4) and decay-resistance leaders (GPT-o3) reinforces that high R1 scores do not guarantee durable multi-turn commitment.

Full methodology, scenario definitions, and rule-based scoring logic: https://www.winzheng.com/yz-index/methodology

Structured run data is available via the public API: https://www.winzheng.com/yz-index/api/v1/dcd

WDCD Run #202: Average Instruction Decay Hits -73.2% Across 11 Models, Gemini 3.1 Pro Leads

相关文章