WDCD Run #185: Average Instruction Decay Hits -57.5% Across 11 Models, Qwen3 Max Leads at 92.5 Points
WDCD Run #185 (2026-06-17) measured multi-turn commitment across 11 models, recording an average instruction decay of -5
WDCD Run #185 (2026-06-17) measured multi-turn commitment across 11 models, recording an average instruction decay of -5
WDCD Run #171 (2026-06-14) measured multi-turn commitment across 11 frontier models, recording an average instruction de
WDCD Run #169 (2026-06-13) evaluated 11 AI models on multi-turn commitment integrity, with Grok 4 topping the leaderboar
WDCD Run #164 (2026-06-11) evaluated 11 frontier models across three dialogue rounds, recording an average commitment de
WDCD Run #161 (2026-06-11) evaluated 11 large language models on multi-turn commitment integrity, recording an average i
WDCD Run #157 (2026-06-10) recorded a 47.7% average commitment decay across 11 models, with Claude Sonnet 4.6, Gemini 2.
WDCD Run #146 (2026-06-03) tested 11 frontier models on multi-turn commitment integrity, recording an average instructio
WDCD Run #140 (2026-05-31) evaluated 11 frontier models on multi-turn commitment integrity, finding an average instructi
WDCD Run #135 (2026-05-27) evaluated 11 large language models across three dialogue rounds, finding an average commitmen
WDCD Run #125 (2026-05-20) tested 11 large language models on multi-turn commitment integrity, with average instruction
WDCD Run #120 (2026-05-17) measured multi-turn commitment across 11 frontier models, recording an average instruction de
WDCD Run #115 evaluated 11 frontier models on multi-turn commitment integrity, recording a 49.2% average instruction dec
WDCD Run #100 (2026-05-03) tested 11 frontier models on multi-turn commitment integrity, recording an average instructio