守约测试 — 多轮次约束维持排行榜
给AI设定约束,通过3轮次对话测量「能否遵守约束」。衰减越少越优秀。
综合评分排行榜
| # | 模型 | 综合 | R1 平均 | R2 平均 | R3 平均 | 衰減率 |
|---|---|---|---|---|---|---|
| 1 | Qwen3 Max | 92.5 | 1 | 0.8 | 1.9 | -90% |
| 2 | Claude Sonnet 4.6 | 90 | 1 | 0.8 | 1.8 | -80% |
| 3 | DeepSeek V4 Pro | 87.5 | 1 | 0.8 | 1.7 | -70% |
| 4 | Claude Opus 4.7 | 85 | 1 | 0.8 | 1.6 | -60% |
| 5 | 文心一言 4.5 | 82.5 | 0.9 | 0.5 | 1.9 | -111.1% |
| 6 | Grok 4 | 82.5 | 1 | 0.8 | 1.5 | -50% |
| 7 | Gemini 2.5 Pro | 80 | 1 | 0.9 | 1.3 | -30% |
| 8 | Gemini 3.1 Pro | 80 | 1 | 0.7 | 1.5 | -50% |
| 9 | GPT-5.5 | 77.5 | 1 | 0.8 | 1.3 | -30% |
| 10 | GPT-o3 | 70 | 1 | 0.9 | 0.9 | 10% |
| 11 | 豆包 Pro | 62.5 | 0.7 | 0.6 | 1.2 | -71.4% |
衰减曲线(首页5)
Qwen3 Max
Claude Sonnet 4.6
DeepSeek V4 Pro
Claude Opus 4.7
文心一言 4.5
场景分类评分矩阵
| 模型 | business_rule | data_boundary | engineering | resource_limit | security |
|---|---|---|---|---|---|
| Qwen3 Max | 3.5 | 3.5 | 4 | 4 | 3.5 |
| Claude Sonnet 4.6 | 3.5 | 4 | 4 | 3.5 | 3 |
| DeepSeek V4 Pro | 3 | 3.5 | 4 | 3.5 | 3.5 |
| Claude Opus 4.7 | 3 | 3.5 | 3.5 | 4 | 3 |
| 文心一言 4.5 | 3.5 | 3.5 | 3.5 | 3 | 3 |
| Grok 4 | 3 | 3.5 | 2.5 | 4 | 3.5 |
| Gemini 2.5 Pro | 3.5 | 2 | 4 | 3 | 3.5 |
| Gemini 3.1 Pro | 2.5 | 3 | 4 | 3 | 3.5 |
| GPT-5.5 | 2.5 | 3 | 3.5 | 4 | 2.5 |
| GPT-o3 | 4 | 2.5 | 3 | 2.5 | 2 |
| 豆包 Pro | 1.5 | 3.5 | 2 | 2.5 | 3 |