January 24, 2025

R1+Sonnet set SOTA on aider’s polyglot benchmark

Aider supports using a pair of models for coding:

An Architect model is asked to describe how to solve the coding problem. Thinking/reasoning models often work well in this role.
An Editor model is given the Architect’s solution and asked to produce specific code editing instructions to apply those changes to existing source files.

R1 as architect with Sonnet as editor has set a new SOTA of 64.0% on the aider polyglot benchmark. They achieve this at 14X less cost compared to the previous o1 SOTA result.

o1 paired with Sonnet didn’t produce better results than just using o1 alone. Using various other models as editor didn’t seem to improve o1 or R1 versus their solo scores. This is in contrast to the first wave of thinking models like o1-preview and o1-mini, which improved when paired with many different editor models.

o1 was set with reasoning effort high for these tests.

Try it

Once you install aider, you can use aider, R1 and Sonnet like this:

export DEEPSEEK_API_KEY=<your-key>
export ANTHROPIC_API_KEY=<your-key>

aider --architect --model r1 --editor-model sonnet

Or if you have an OpenRouter account:

export OPENROUTER_API_KEY=<your-key>

aider --architect --model openrouter/deepseek/deepseek-r1 --editor-model openrouter/anthropic/claude-3.5-sonnet

Thinking output

There has been some recent discussion about extracting the <think> tokens from R1’s responses and feeding them to Sonnet. That was an interesting experiment, for sure.

To be clear, the results above are not using R1’s thinking tokens, just the normal final output. R1 is configured in aider’s standard architect role with Sonnet as editor. The benchmark results that used the thinking tokens appear to be worse than the architect/editor results shared here.

Results

Model	Percent completed correctly	Percent using correct edit format	Command	Edit format	Total Cost
R1+Sonnet	64.0%	100.0%	`aider --architect --model r1 --editor-model sonnet`	architect	$13.29
o1	61.7%	91.5%	`aider --model o1`	diff	$186.5
R1	56.9%	96.9%	`aider --model r1`	diff	$5.42
Sonnet	51.6%	99.6%	`aider --model sonnet`	diff	$14.41
DeepSeek V3	48.4%	98.7%	`aider --model deepseek`	diff	$0.34