Aider LLM Leaderboards
Aider works best with LLMs which are good at editing code, not just good at writing code. To evaluate an LLM’s editing skill, aider uses benchmarks that assess a model’s ability to consistently follow the system prompt to successfully edit code.
The leaderboards report the results from a number of popular LLMs. While aider can connect to almost any LLM, it works best with models that score well on the benchmarks.
The original aider code editing leaderboard has been replaced by this new, much more challenging polyglot leaderboard.
Polyglot leaderboard
Aider’s polyglot benchmark asks the LLM to edit source files to complete 225 coding exercises from Exercism. It contains exercises in many popular programming languages: C++, Go, Java, JavaScript, Python and Rust. The 225 exercises were purposely selected to be the hardest that Exercism offered in those languages, to provide a strong coding challenge to LLMs.
This benchmark measures the LLM’s coding ability in popular languages, and whether it can write new code that integrates into existing code. The model also has to successfully apply all its changes to the source file without human intervention.
Model | Percent completed correctly | Percent using correct edit format | Command | Edit format |
---|---|---|---|---|
o1-2024-12-17 (high) | 61.7% | 91.5% | aider --model openrouter/openai/o1 |
diff |
DeepSeek Chat V3 | 48.4% | 98.7% | aider --model deepseek/deepseek-chat |
diff |
claude-3-5-sonnet-20241022 | 45.3% | 100.0% | aider --model claude-3-5-sonnet-20241022 |
diff |
gemini-exp-1206 | 38.2% | 98.2% | aider --model gemini/gemini-exp-1206 |
whole |
o1-mini-2024-09-12 | 32.9% | 96.9% | aider --model o1-mini |
whole |
claude-3-5-haiku-20241022 | 28.0% | 91.1% | aider --model claude-3-5-haiku-20241022 |
diff |
gemini-2.0-flash-exp | 22.2% | 100.0% | aider --model gemini/gemini-2.0-flash-exp |
whole |
DeepSeek Chat V2.5 | 17.8% | 92.9% | aider --model deepseek/deepseek-chat |
diff |
Qwen2.5-Coder-32B-Instruct | 16.4% | 99.6% | aider --model openai/Qwen2.5-Coder-32B-Instruct |
whole |
gpt-4o-2024-11-20 | 15.1% | 96.0% | aider --model gpt-4o-2024-11-20 |
diff |
yi-lightning | 12.9% | 92.9% | aider --model openai/yi-lightning |
whole |
Codestral 25.01 | 11.1% | 100.0% | aider --model mistral/codestral-latest |
whole |
Qwen2.5-Coder-32B-Instruct | 8.0% | 71.6% | aider --model openai/Qwen/Qwen2.5-Coder-32B-Instruct # via hyperbolic |
diff |
gpt-4o-mini-2024-07-18 | 3.6% | 100.0% | aider --model gpt-4o-mini-2024-07-18 |
whole |
Aider polyglot benchmark results
By Paul Gauthier, last updated January 13, 2025.