Aider LLM Leaderboards

Aider works best with LLMs which are good at editing code, not just good at writing code. To evaluate an LLM’s editing skill, aider uses benchmarks that assess a model’s ability to consistently follow the system prompt to successfully edit code.

The leaderboards report the results from a number of popular LLMs. While aider can connect to almost any LLM, it works best with models that score well on the benchmarks.

Polyglot leaderboard

Aider’s polyglot benchmark asks the LLM to edit source files to complete 225 coding exercises from Exercism. It contains exercises in many popular programming languages: C++, Go, Java, JavaScript, Python and Rust. The 225 exercises were purposely selected to be the hardest that Exercism offered in those languages, to provide a strong coding challenge to LLMs.

This benchmark measures the LLM’s coding ability in popular languages, and whether it can write new code that integrates into existing code. The model also has to successfully apply all its changes to the source file without human intervention.

Model Percent completed correctly Percent using correct edit format Command Edit format Total Cost
DeepSeek R1 + claude-3-5-sonnet-20241022 64.0% 100.0% aider --architect --model r1 --editor-model sonnet architect $13.29
o1-2024-12-17 (high) 61.7% 91.5% aider --model openrouter/openai/o1 diff $186.5
o3-mini (high) 60.4% 93.3% aider --model o3-mini --reasoning-effort high diff $18.16
DeepSeek R1 56.9% 96.9% aider --model deepseek/deepseek-reasoner diff $5.42
o3-mini (medium) 53.8% 95.1% aider --model o3-mini diff $8.86
claude-3-5-sonnet-20241022 51.6% 99.6% aider --model claude-3-5-sonnet-20241022 diff $14.41
DeepSeek Chat V3 48.4% 98.7% aider --model deepseek/deepseek-chat diff $0.34
gemini-exp-1206 38.2% 98.2% aider --model gemini/gemini-exp-1206 whole ?
o1-mini-2024-09-12 32.9% 96.9% aider --model o1-mini whole $18.58
claude-3-5-haiku-20241022 28.0% 91.1% aider --model claude-3-5-haiku-20241022 diff $6.06
gpt-4o-2024-08-06 23.1% 94.2% aider --model gpt-4o-2024-08-06 diff $7.03
gemini-2.0-flash-exp 22.2% 100.0% aider --model gemini/gemini-2.0-flash-exp whole ?
qwen-max-2025-01-25 21.8% 90.2% OPENAI_API_BASE=https://dashscope-intl.aliyuncs.com/compatible-mode/v1 aider --model openai/qwen-max-2025-01-25 diff $0.0
gemini-2.0-flash-thinking-exp-01-21 18.2% 77.8% aider --model gemini/gemini-2.0-flash-thinking-exp-01-21 diff ?
gpt-4o-2024-11-20 18.2% 95.1% aider --model gpt-4o-2024-11-20 diff $6.74
DeepSeek Chat V2.5 17.8% 92.9% aider --model deepseek/deepseek-chat diff $0.51
Qwen2.5-Coder-32B-Instruct 16.4% 99.6% aider --model openai/Qwen2.5-Coder-32B-Instruct whole ?
yi-lightning 12.9% 92.9% aider --model openai/yi-lightning whole ?
Codestral 25.01 11.1% 100.0% aider --model mistral/codestral-latest whole $1.98
Qwen2.5-Coder-32B-Instruct 8.0% 71.6% aider --model openai/Qwen/Qwen2.5-Coder-32B-Instruct # via hyperbolic diff ?
gpt-4o-mini-2024-07-18 3.6% 100.0% aider --model gpt-4o-mini-2024-07-18 whole $0.32

Aider polyglot benchmark results


Table of contents