WDCD Run #171: Average Instruction Decay Hits -37.9% Across 11 Models, Qwen3 Max Leads Despite Steep Drop

2026年6月14日 8 约5分钟 Winzheng Research Lab

WDCD AI benchmark instruction decay multi-turn commitment test

WDCD Run #171: Average Instruction Decay Hits -37.9% Across 11 Models, Qwen3 Max Leads Despite Steep Drop

The Winzheng Dynamic Contextual Decay (WDCD) benchmark measures how AI models' commitment to user instructions degrades across multi-turn dialogue, using 100% rule-based scoring with zero AI judges. In Run #171, dated 2026-06-14, eleven models were evaluated across 30 questions spanning five real-world scenarios — data_boundary, resource_limit, business_rule, security, and engineering — producing an average instruction decay of -37.9% from Round 1 to Round 3.

Each WDCD run follows a fixed three-round protocol: R1 captures initial instruction acknowledgment, R2 tests distractor resistance after the model processes 2,000–5,000 word professional documents, and R3 performs a final constraint integrity check. The gap between R1 and R3 quantifies multi-turn commitment loss.

Top performers in Run #171:

Qwen3 Max — 84.4 points, -59% decay
Grok 4 — 82.0 points, -44% decay
Gemini 3.1 Pro — 79.7 points, -47% decay

Qwen3 Max retains the top score despite one of the steeper decay curves in the cohort, indicating that its R1 baseline was high enough to absorb significant constraint loss in later rounds. Grok 4 and Gemini 3.1 Pro both delivered milder decay than the leader while landing within a few points of the top score.

Decay distribution: The smallest decay was observed in GPT-o3 at -16%, the strongest decay resistance of any model in this run. At the opposite end, 豆包 Pro (Doubao Pro) recorded the worst decay resistance at -112.7%, meaning its R3 commitment score fell more than its R1 baseline — a pattern typically associated with constraint inversion after distractor exposure rather than partial forgetting.

The -37.9% cohort average confirms a recurring WDCD finding: instruction decay is not primarily a function of model size or headline score, but of how constraint state is preserved when long professional documents are injected mid-dialogue. Models with strong R1 acknowledgment can still lose more than half of their adherence by R3, while models with modest R1 scores sometimes hold constraints more reliably.

Run #171 reinforces that ranking by raw score and ranking by multi-turn commitment stability produce different leaderboards. GPT-o3's -16% decay places it as the most stable model in this run despite not appearing in the top three by total score.

Full methodology: https://www.winzheng.com/yz-index/methodology
Machine-readable data: https://www.winzheng.com/yz-index/api/v1/dcd

WDCD Run #171: Average Instruction Decay Hits -37.9% Across 11 Models, Qwen3 Max Leads Despite Steep Drop

相关文章