Code editing leaderboard
This old aider code editing leaderboard has been replaced by the new, much more challenging polyglot leaderboard.
Aider’s code editing benchmark asks the LLM to edit python source files to complete 133 small coding exercises from Exercism. This measures the LLM’s coding ability, and whether it can write new code that integrates into existing code. The model also has to successfully apply all its changes to the source file without human intervention.
Model | Percent completed correctly | Percent using correct edit format | Command | Edit format |
---|---|---|---|---|
o1 | 84.2% | 99.2% | aider --model openrouter/openai/o1 |
diff |
claude-3-5-sonnet-20241022 | 84.2% | 99.2% | aider --model anthropic/claude-3-5-sonnet-20241022 |
diff |
gemini-exp-1206 (whole) | 80.5% | 100.0% | aider --model gemini/gemini-exp-1206 |
whole |
o1-preview | 79.7% | 93.2% | aider --model o1-preview |
diff |
claude-3.5-sonnet-20240620 | 77.4% | 99.2% | aider --model claude-3.5-sonnet-20240620 |
diff |
claude-3-5-haiku-20241022 | 75.2% | 95.5% | aider --model anthropic/claude-3-5-haiku-20241022 |
diff |
ollama/qwen2.5-coder:32b | 72.9% | 100.0% | aider --model ollama/qwen2.5-coder:32b |
whole |
DeepSeek Coder V2 0724 | 72.9% | 97.7% | aider --model deepseek/deepseek-coder |
diff |
gpt-4o-2024-05-13 | 72.9% | 96.2% | aider |
diff |
DeepSeek-V2.5-1210 | 72.2% | 99.2% | aider --model deepseek/deepseek-chat |
diff |
openai/chatgpt-4o-latest | 72.2% | 97.0% | aider --model openai/chatgpt-4o-latest |
diff |
DeepSeek V2.5 | 72.2% | 96.2% | aider --deepseek |
diff |
gpt-4o-2024-11-20 | 71.4% | 99.2% | aider --model openai/gpt-4o-2024-11-20 |
diff |
Qwen2.5-Coder-32B-Instruct | 71.4% | 94.7% | aider --model openai/hf:Qwen/Qwen2.5-Coder-32B-Instruct --openai-api-base https://glhf.chat/api/openai/v1 |
diff |
gpt-4o-2024-08-06 | 71.4% | 98.5% | aider --model openai/gpt-4o-2024-08-06 |
diff |
o1-mini (whole) | 70.7% | 90.0% | aider --model o1-mini |
whole |
gemini-2.0-flash-exp | 69.9% | 97.0% | aider --model gemini/gemini-2.0-flash-exp |
diff |
DeepSeek Chat V2 0628 | 69.9% | 97.7% | aider --model deepseek/deepseek-chat |
diff |
gemini-exp-1206 (diff) | 69.2% | 84.2% | aider --model gemini/gemini-exp-1206 |
diff |
Qwen2.5-Coder-14B-Instruct | 69.2% | 100.0% | aider --model openai/Qwen2.5-Coder-14B-Instruct |
whole |
claude-3-opus-20240229 | 68.4% | 100.0% | aider --opus |
diff |
gpt-4-0613 | 67.7% | 100.0% | aider -4 |
diff |
Dracarys2-72B-Instruct | 66.9% | 100.0% | (via glhf.chat) |
whole |
gemini-1.5-pro-exp-0827 | 66.9% | 94.7% | aider --model gemini/gemini-1.5-pro-exp-0827 |
diff-fenced |
llama-3.1-405b-instruct (whole) | 66.2% | 100.0% | aider --model openrouter/meta-llama/llama-3.1-405b-instruct |
whole |
gpt-4-0314 | 66.2% | 93.2% | aider --model gpt-4-0314 |
diff |
gpt-4-0125-preview | 66.2% | 97.7% | aider --model gpt-4-0125-preview |
udiff |
yi-lightning | 65.4% | 97.0% | aider --model openai/yi-lightning |
whole |
openrouter/qwen/qwen-2.5-coder-32b-instruct | 65.4% | 84.2% | aider --model openrouter/qwen/qwen-2.5-coder-32b-instruct |
diff |
Mistral Large (2411) | 65.4% | 96.2% | aider --model mistral/mistral-large-latest |
diff |
gemini-1.5-pro-002 | 65.4% | 96.2% | aider --model gemini/gemini-1.5-pro-002 |
diff-fenced |
qwen-2.5-72b-instruct (bf16) | 65.4% | 96.2% | aider --model openrouter/qwen/qwen-2.5-72b-instruct |
diff |
gpt-4-1106-preview | 65.4% | 92.5% | aider --model gpt-4-1106-preview |
udiff |
ollama/Qwen2.5.1-Coder-7B-Instruct-GGUF:Q8_0-32k | 63.9% | 100.0% | aider --model ollama/Qwen2.5.1-Coder-7B-Instruct-GGUF:Q8_0-32k |
whole |
nousresearch/hermes-3-llama-3.1-405b | 63.9% | 100.0% | aider --model openrouter/nousresearch/hermes-3-llama-3.1-405b |
whole |
llama-3.1-405b-instruct (diff) | 63.9% | 92.5% | aider --model openrouter/meta-llama/llama-3.1-405b-instruct |
diff |
gpt-4-turbo-2024-04-09 (udiff) | 63.9% | 97.0% | aider --gpt-4-turbo |
udiff |
ollama/qwen2.5-coder:14b | 61.7% | 98.5% | aider --model ollama/qwen2.5-coder:14b |
whole |
o1-mini | 61.1% | 100.0% | aider --model o1-mini |
diff |
gemini-exp-1114 | 60.9% | 85.7% | aider --model gemini/gemini-exp-1114 |
diff |
Mistral Large 2 (2407) | 60.2% | 100.0% | aider --model mistral/mistral-large-2407 |
whole |
llama-3.3-70b-instruct | 59.4% | 88.7% | aider --model openrouter/meta-llama/llama-3.3-70b-instruct |
diff |
ollama/qwen2.5:32b-instruct-q8_0 | 58.6% | 100.0% | aider --model ollama/qwen2.5:32b-instruct-q8_0 |
whole |
Grok-2 | 58.6% | 98.5% | aider --model openrouter/x-ai/grok-2 |
whole |
llama-3.1-70b-instruct | 58.6% | 100.0% | aider --model fireworks_ai/accounts/fireworks/models/llama-v3p1-70b-instruct |
whole |
gemini-exp-1121 | 57.9% | 83.5% | aider --model gemini/gemini-exp-1121 |
diff |
Qwen2.5-Coder-7B-Instruct | 57.9% | 100.0% | aider --model openai/Qwen2.5-Coder-7B-Instruct |
whole |
gpt-3.5-turbo-0301 | 57.9% | 100.0% | aider --model gpt-3.5-turbo-0301 |
whole |
gpt-4-turbo-2024-04-09 (diff) | 57.6% | 100.0% | aider --model gpt-4-turbo-2024-04-09 |
diff |
gemini-1.5-pro-001 | 57.1% | 87.2% | aider --model gemini/gemini-1.5-pro-latest |
diff-fenced |
gpt-3.5-turbo-1106 | 56.1% | 100.0% | aider --model gpt-3.5-turbo-1106 |
whole |
gpt-4o-mini | 55.6% | 100.0% | aider --model gpt-4o-mini |
whole |
Qwen2 72B Instruct | 55.6% | 100.0% | aider --model together_ai/qwen/Qwen2-72B-Instruct |
whole |
Llama-3.1-Nemotron-70B-Instruct-HF | 54.9% | 99.2% | (via glhf.chat) |
whole |
Grok-2-mini | 54.9% | 100.0% | aider --model openrouter/x-ai/grok-2-mini |
whole |
claude-3-sonnet-20240229 | 54.9% | 100.0% | aider --sonnet |
whole |
Nova Pro | 54.1% | 100.0% | aider --model bedrock/us.amazon.nova-pro-v1:0 |
whole |
ollama/qwen2.5:32b | 54.1% | 100.0% | aider --model ollama/qwen2.5:32b |
whole |
Yi Coder 9B Chat | 54.1% | 100.0% | aider --model openai/hf:01-ai/Yi-Coder-9B-Chat --openai-api-base https://glhf.chat/api/openai/v1 |
whole |
gemini-1.5-flash-exp-0827 | 52.6% | 100.0% | aider --model gemini/gemini-1.5-flash-exp-0827 |
whole |
qwen2.5-coder:7b-instruct-q8_0 | 51.9% | 100.0% | aider --model ollama/qwen2.5-coder:7b-instruct-q8_0 |
whole |
gemini-1.5-flash-002 (0924) | 51.1% | 100.0% | aider --model gemini/gemini-1.5-flash-002 |
whole |
codestral-2405 | 51.1% | 100.0% | aider --model mistral/codestral-2405 |
whole |
gpt-3.5-turbo-0613 | 50.4% | 100.0% | aider --model gpt-3.5-turbo-0613 |
whole |
gpt-3.5-turbo-0125 | 50.4% | 100.0% | aider -3 |
whole |
qwen2:72b-instruct-q8_0 | 49.6% | 100.0% | aider --model ollama/qwen2:72b-instruct-q8_0 |
whole |
llama3-70b-8192 | 49.2% | 73.5% | aider --model groq/llama3-70b-8192 |
diff |
Codestral-22B-v0.1-Q4_K_M | 48.1% | 100.0% | aider --model Codestral-22B-v0.1-Q4_K_M |
whole |
codestral:22b-v0.1-q8_0 | 48.1% | 100.0% | aider --model ollama/codestral:22b-v0.1-q8_0 |
whole |
claude-3-haiku-20240307 | 47.4% | 100.0% | aider --model claude-3-haiku-20240307 |
whole |
ollama/codestral | 45.9% | 98.5% | aider --model ollama/codestral |
whole |
yi-coder:9b-chat-q4_0 | 45.1% | 100.0% | aider --model ollama/yi-coder:9b-chat-q4_0 |
whole |
gemini-1.5-flash-latest | 44.4% | 100.0% | aider --model gemini/gemini-1.5-flash-latest |
whole |
WizardLM-2 8x22B | 44.4% | 100.0% | aider --model openrouter/microsoft/wizardlm-2-8x22b |
whole |
ollama/yi-coder:9b-chat-fp16 | 43.6% | 99.2% | aider --model ollama/yi-coder:9b-chat-fp16 |
whole |
Reflection-70B | 42.1% | 100.0% | (not currently supported) |
whole |
Qwen2.5-Coder-3B-Instruct | 39.1% | 100.0% | aider --model openai/Qwen2.5-Coder-3B-Instruct |
whole |
ollama/mistral-small | 38.3% | 99.2% | aider --model ollama/mistral-small |
whole |
gemini-1.5-flash-8b-exp-0924 | 38.3% | 100.0% | aider --model gemini/gemini-1.5-flash-8b-exp-0924 |
whole |
Command R (08-24) | 38.3% | 100.0% | aider --model command-r-08-2024 |
whole |
Command R+ (08-24) | 38.3% | 100.0% | aider --model command-r-plus-08-2024 |
whole |
gemini-1.5-flash-8b-exp-0827 | 38.3% | 100.0% | aider --model gemini/gemini-1.5-flash-8b-exp-0827 |
whole |
llama-3.1-8b-instruct | 37.6% | 100.0% | aider --model fireworks_ai/accounts/fireworks/models/llama-v3p1-8b-instruct |
whole |
qwen1.5-110b-chat | 37.6% | 100.0% | aider --model together_ai/qwen/qwen1.5-110b-chat |
whole |
gemma2:27b-instruct-q8_0 | 36.1% | 100.0% | aider --model ollama/gemma2:27b-instruct-q8_0 |
whole |
codeqwen:7b-chat-v1.5-q8_0 | 34.6% | 100.0% | aider --model ollama/codeqwen:7b-chat-v1.5-q8_0 |
whole |
ollama/mistral-nemo:12b-instruct-2407-q4_K_M | 33.1% | 100.0% | aider --model ollama/mistral-nemo:12b-instruct-2407-q4_K_M |
whole |
ollama/codegeex4 | 32.3% | 97.0% | aider --model ollama/codegeex4 |
whole |
Qwen2.5-Coder-1.5B-Instruct | 31.6% | 100.0% | aider --model openai/Qwen2.5-Coder-1.5B-Instruct |
whole |
command-r-plus | 31.6% | 100.0% | aider --model command-r-plus |
whole |
ollama/hermes3:8b-llama3.1-fp16 | 30.1% | 98.5% | aider --model ollama/hermes3:8b-llama3.1-fp16 |
whole |
ollama/wojtek/opencodeinterpreter:6.7b | 30.1% | 91.0% | aider --model ollama/wojtek/opencodeinterpreter:6.7b |
whole |
o1-mini-2024-09-12 | 27.1% | 95.6% | aider --model o1-mini |
whole |
ollama/tulu3 | 26.3% | 100.0% | aider --model ollama/tulu3 |
whole |
ollama/llama3.2:3b-instruct-fp16 | 26.3% | 97.0% | aider --model ollama/llama3.2:3b-instruct-fp16 |
whole |
ollama/hermes3 | 22.6% | 98.5% | aider --model ollama/hermes3 |
whole |
ollama/granite3-dense:8b | 20.3% | 78.9% | aider --model ollama/granite3-dense:8b |
whole |
Qwen2.5-Coder-0.5B-Instruct | 14.3% | 100.0% | aider --model openai/Qwen2.5-Coder-0.5B-Instruct |
whole |
Notes on benchmarking results
The key benchmarking results are:
- Percent completed correctly - Measures what percentage of the coding tasks that the LLM completed successfully. To complete a task, the LLM must solve the programming assignment and edit the code to implement that solution.
- Percent using correct edit format - Measures the percent of coding tasks where the LLM complied with the edit format specified in the system prompt. If the LLM makes edit mistakes, aider will give it feedback and ask for a fixed copy of the edit. The best models can reliably conform to the edit format, without making errors.
Notes on the edit format
Aider uses different “edit formats” to collect code edits from different LLMs. The “whole” format is the easiest for an LLM to use, but it uses a lot of tokens and may limit how large a file can be edited. Models which can use one of the diff formats are much more efficient, using far fewer tokens. Models that use a diff-like format are able to edit larger files with less cost and without hitting token limits.
Aider is configured to use the best edit format for the popular OpenAI and Anthropic models and the other models recommended on the LLM page. For lesser known models aider will default to using the “whole” editing format since it is the easiest format for an LLM to use.
Contributing benchmark results
Contributions of benchmark results are welcome! See the benchmark README for information on running aider’s code editing benchmarks. Submit results by opening a PR with edits to the benchmark results data files.
By Paul Gauthier, last updated December 16, 2024.