WDCD Run #169: Grok 4 Leads Multi-Turn Commitment Test as Average Instruction Decay Drops to 4.5%

2026年6月14日 10 约6分钟 Winzheng Research Lab

WDCD AI benchmark instruction decay multi-turn commitment test

The Winzheng Dynamic Contextual Decay (WDCD) benchmark measures how AI models' commitment to user instructions decays over multi-turn dialogue, using 100% rule-based scoring and zero AI judges across 30 questions in five real-world scenarios: data_boundary, resource_limit, business_rule, security, and engineering. In Run #169, conducted on 2026-06-13, the average instruction decay across 11 tested models from Round 1 to Round 3 settled at 4.5%.

Each run executes three structured rounds: R1 establishes instruction acknowledgment, R2 inserts 2000-5000 word professional documents as distractors to test resistance, and R3 performs a final constraint integrity check. The gap between R1 and R3 quantifies multi-turn commitment stability.

Top 3 Models — Run #169:

Grok 4 — 74.2 pts (decay: -25.8%)
Qwen3 Max — 67.2 pts (decay: -12.4%)
Gemini 2.5 Pro — 66.4 pts (decay: -3%)

Grok 4 took the top score despite registering a notable -25.8% decay curve, indicating strong R1/R2 performance that partially eroded by R3. Gemini 2.5 Pro recorded the most stable trajectory among the top three at just -3%, while Qwen3 Max balanced raw score and decay resistance to take second place.

Decay pattern observations: The 4.5% cross-model average masks substantial dispersion between individual models. At the lower end of the distribution, GPT-o3 exhibited the most severe instruction decay in this run at -75%, indicating heavy erosion of original constraints after distractor exposure in R2 and the integrity check in R3. In contrast, 豆包 Pro (Doubao Pro) delivered the strongest decay resistance profile at -58% relative measurement among the worst-decay cohort, demonstrating comparatively tighter constraint retention across rounds.

The spread between best and worst decay outcomes in Run #169 reinforces a recurring WDCD finding: headline single-turn capability scores do not reliably predict multi-turn commitment behavior. Models scoring competitively in R1 can still lose substantial constraint fidelity once professional-document distractors are introduced.

Scenario coverage remained consistent with prior runs, spanning data boundary enforcement, resource limit adherence, business rule compliance, security constraints, and engineering specifications — all scored deterministically against rule-based criteria.

Full methodology and round-by-round scoring rules are documented at https://www.winzheng.com/yz-index/methodology.

Structured Run #169 data, including per-model R1/R2/R3 breakdowns, is available via the WDCD data API: https://www.winzheng.com/yz-index/api/v1/dcd.

WDCD Run #169: Grok 4 Leads Multi-Turn Commitment Test as Average Instruction Decay Drops to 4.5%

相关文章