January 24, 2025
R1+Sonnet set SOTA on aider’s polyglot benchmark
Aider supports using a pair of models for coding:
- An Architect model is asked to describe how to solve the coding problem. Thinking/reasoning models often work well in this role.
- An Editor model is given the Architect’s solution and asked to produce specific code editing instructions to apply those changes to existing source files.
R1 as architect with Sonnet as editor has set a new SOTA of 64.0% on the aider polyglot benchmark. They achieve this at 14X less cost compared to the previous o1 SOTA result.
o1 paired with Sonnet didn’t produce better results than just using o1 alone. Using various other models as editor didn’t seem to improve o1 or R1 versus their solo scores. This is in contrast to the first wave of thinking models like o1-preview and o1-mini, which improved when paired with many different editor models.
Try it
Once you install aider, you can use aider, R1 and Sonnet like this:
export DEEPSEEK_API_KEY=<your-key>
export ANTHROPIC_API_KEY=<your-key>
aider --architect --model r1 --editor-model sonnet
Or if you have an OpenRouter account:
export OPENROUTER_API_KEY=<your-key>
aider --architect --model openrouter/deepseek/deepseek-r1 --editor-model openrouter/anthropic/claude-3.5-sonnet
Thinking output
There has been
some recent discussion
about extracting the <think>
tokens from R1’s responses
and feeding them to Sonnet.
That was an interesting experiment, for sure.
To be clear, the results above are not using R1’s thinking tokens, just the normal final output. R1 is configured in aider’s standard architect role with Sonnet as editor. The benchmark results that used the thinking tokens appear to be worse than the architect/editor results shared here.
Results
Model | Percent completed correctly | Percent using correct edit format | Command | Edit format | Total Cost |
---|---|---|---|---|---|
R1+Sonnet | 64.0% | 100.0% | aider --architect --model r1 --editor-model sonnet |
architect | $13.29 |
o1 | 61.7% | 91.5% | aider --model o1 |
diff | $186.5 |
R1 | 56.9% | 96.9% | aider --model r1 |
diff | $5.42 |
Sonnet | 51.6% | 99.6% | aider --model sonnet |
diff | $14.41 |
DeepSeek V3 | 48.4% | 98.7% | aider --model deepseek |
diff | $0.34 |