R1+Sonnet set SOTA on aider’s polyglot benchmark

Aider supports using a pair of models for coding:

  • An Architect model is asked to describe how to solve the coding problem. Thinking/reasoning models often work well in this role.
  • An Editor model is given the Architect’s solution and asked to produce specific code editing instructions to apply those changes to existing source files.

R1 as architect with Sonnet as editor has set a new SOTA of 64.0% on the aider polyglot benchmark. They achieve this at 14X less cost compared to the previous o1 SOTA result.

o1 paired with Sonnet didn’t produce better results than just using o1 alone. Using various other models as editor didn’t seem to improve o1 or R1 versus their solo scores. This is in contrast to the first wave of thinking models like o1-preview and o1-mini, which improved when paired with many different editor models.

Try it

Once you install aider, you can use aider, R1 and Sonnet like this:

export DEEPSEEK_API_KEY=<your-key>
export ANTHROPIC_API_KEY=<your-key>

aider --architect --model r1 --editor-model sonnet

Or if you have an OpenRouter account:

export OPENROUTER_API_KEY=<your-key>

aider --architect --model openrouter/deepseek/deepseek-r1 --editor-model openrouter/anthropic/claude-3.5-sonnet

Thinking output

There has been some recent discussion about extracting the <think> tokens from R1’s responses and feeding them to Sonnet. That was an interesting experiment, for sure.

To be clear, the results above are not using R1’s thinking tokens, just the normal final output. R1 is configured in aider’s standard architect role with Sonnet as editor. The benchmark results that used the thinking tokens appear to be worse than the architect/editor results shared here.

Results

Model Percent completed correctly Percent using correct edit format Command Edit format Total Cost
R1+Sonnet 64.0% 100.0% aider --architect --model r1 --editor-model sonnet architect $13.29
o1 61.7% 91.5% aider --model o1 diff $186.5
R1 56.9% 96.9% aider --model r1 diff $5.42
Sonnet 51.6% 99.6% aider --model sonnet diff $14.41
DeepSeek V3 48.4% 98.7% aider --model deepseek diff $0.34