守约测试 — 多轮次约束维持排行榜
给AI设定约束,通过3轮次对话测量「能否遵守约束」。衰减越少越优秀。
综合评分排行榜
| # | 模型 | 综合 | R1 平均 | R2 平均 | R3 平均 | 衰減率 |
|---|---|---|---|---|---|---|
| 1 | Qwen3 Max | 65 | 1 | 0.9 | 0.7 | 30% |
| 2 | Claude Sonnet 4.6 | 62.5 | 1 | 1 | 0.5 | 50% |
| 3 | DeepSeek V4 Pro | 62.5 | 1 | 0.8 | 0.7 | 30% |
| 4 | 文心一言 4.5 | 62.5 | 0.8 | 0.9 | 0.8 | 0% |
| 5 | GPT-o3 | 62.5 | 1 | 0.9 | 0.6 | 40% |
| 6 | Claude Opus 4.7 | 60 | 1 | 0.8 | 0.6 | 40% |
| 7 | Gemini 2.5 Pro | 60 | 1 | 0.9 | 0.5 | 50% |
| 8 | Gemini 3.1 Pro | 60 | 1 | 1 | 0.4 | 60% |
| 9 | 豆包 Pro | 55 | 0.7 | 1 | 0.5 | 28.6% |
| 10 | GPT-5.5 | 55 | 1 | 0.8 | 0.4 | 60% |
| 11 | Grok 4 | 50 | 1 | 0.8 | 0.2 | 80% |
衰减曲线(首页5)
Qwen3 Max
Claude Sonnet 4.6
DeepSeek V4 Pro
文心一言 4.5
GPT-o3
场景分类评分矩阵
| 模型 | business_rule | data_boundary | engineering | resource_limit | security |
|---|---|---|---|---|---|
| Qwen3 Max | 3 | 3.5 | 2 | 2 | 2.5 |
| Claude Sonnet 4.6 | 3 | 2.5 | 2 | 2.5 | 2.5 |
| DeepSeek V4 Pro | 3.5 | 2.5 | 2 | 1.5 | 3 |
| 文心一言 4.5 | 2.5 | 2.5 | 2.5 | 2.5 | 2.5 |
| GPT-o3 | 2.5 | 2.5 | 2 | 2 | 3.5 |
| Claude Opus 4.7 | 2.5 | 2.5 | 2.5 | 3 | 1.5 |
| Gemini 2.5 Pro | 3 | 3 | 1.5 | 2 | 2.5 |
| Gemini 3.1 Pro | 3 | 2.5 | 2 | 2 | 2.5 |
| 豆包 Pro | 3 | 2 | 1.5 | 2 | 2.5 |
| GPT-5.5 | 1.5 | 2.5 | 2 | 2 | 3 |
| Grok 4 | 2.5 | 2 | 1.5 | 2 | 2 |