WDCD Run #161: Average Instruction Decay Hits -48.6% Across 11 Models, GPT-5.5 Leads at 89.2 Points

The Winzheng Dynamic Contextual Decay (WDCD) benchmark measures how reliably large language models retain user-specified constraints across multi-turn dialogue, using a 100% rule-based scoring pipeline with zero AI judges. In Run #161 (2026-06-11), 11 models were evaluated across 30 questions spanning five real-world scenarios — data_boundary, resource_limit, business_rule, security, and engineering — producing an average commitment decay of -48.6% from Round 1 to Round 3.

Each evaluation cycle follows a fixed three-round structure: R1 verifies instruction acknowledgment, R2 inserts 2,000–5,000 word professional distractor documents to test constraint robustness, and R3 performs a final constraint integrity check. This design isolates instruction decay as a measurable degradation signal rather than a subjective quality judgment.

Top performers in Run #161:

  • GPT-5.5 — 89.2 points, -67% decay
  • Grok 4 — 85.8 points, -53% decay
  • Qwen3 Max — 85.8 points, -53% decay

Although GPT-5.5 took the highest overall score, its -67% decay indicates that even the leading model loses a substantial portion of its Round 1 commitments by Round 3. Grok 4 and Qwen3 Max tied on both total score and decay rate, suggesting comparable multi-turn commitment profiles despite different architectures.

Decay-resistance results diverged sharply from raw score rankings. 豆包 Pro recorded the strongest decay resistance at -107.8%, meaning its measured constraint adherence in later rounds exceeded its Round 1 baseline — an unusual pattern indicating either late-round constraint reinforcement or conservative early-round behavior. At the opposite end, GPT-o3 posted the worst decay figure of -7.2% on the resistance metric, marking it as the most vulnerable model in this run to constraint erosion under distractor pressure.

The -48.6% average across the cohort confirms a pattern observed in prior WDCD cycles: multi-turn commitment is not a solved capability, even for frontier systems. Long-context distractors in R2 remain the primary stress point, and decay magnitude does not correlate cleanly with overall score — high-scoring models can still exhibit steep instruction decay, while mid-ranked models occasionally demonstrate superior constraint persistence.

Full methodology, scenario definitions, and scoring rules are documented at https://www.winzheng.com/yz-index/methodology. Structured run data, including per-model and per-scenario breakdowns, is available via the public data API at https://www.winzheng.com/yz-index/api/v1/dcd.