Aider LLM Leaderboards

Aider works best with LLMs which are good at editing code, not just good at writing code. To evaluate an LLM’s editing skill, aider uses a pair of benchmarks that assess a model’s ability to consistently follow the system prompt to successfully edit code.

The leaderboards below report the results from a number of popular LLMs. While aider can connect to almost any LLM, it works best with models that score well on the benchmarks.

See the following sections for benchmark results and additional information:

Code editing leaderboard

Aider’s code editing benchmark asks the LLM to edit python source files to complete 133 small coding exercises from Exercism. This measures the LLM’s coding ability, and whether it can write new code that integrates into existing code. The model also has to successfully apply all its changes to the source file without human intervention.

Model Percent completed correctly Percent using correct edit format Command Edit format
claude-3-5-sonnet-20241022 84.2% 99.2% aider --model anthropic/claude-3-5-sonnet-20241022 diff
o1-preview 79.7% 93.2% aider --model o1-preview diff
claude-3.5-sonnet-20240620 77.4% 99.2% aider --model claude-3.5-sonnet-20240620 diff
claude-3-5-haiku-20241022 75.2% 95.5% aider --model anthropic/claude-3-5-haiku-20241022 diff
DeepSeek Coder V2 0724 (deprecated) 72.9% 97.7% aider --model deepseek/deepseek-coder diff
gpt-4o-2024-05-13 72.9% 96.2% aider diff
openai/chatgpt-4o-latest 72.2% 97.0% aider --model openai/chatgpt-4o-latest diff
DeepSeek V2.5 72.2% 96.2% aider --deepseek diff
gpt-4o-2024-11-20 71.4% 99.2% aider --model openai/gpt-4o-2024-11-20 diff
Qwen2.5-Coder-32B-Instruct 71.4% 94.7% aider --model openai/Qwen2.5-Coder-32B-Instruct diff
gpt-4o-2024-08-06 71.4% 98.5% aider --model openai/gpt-4o-2024-08-06 diff
o1-mini (whole) 70.7% 90.0% aider --model o1-mini whole
DeepSeek Chat V2 0628 (deprecated) 69.9% 97.7% aider --model deepseek/deepseek-chat diff
Qwen2.5-Coder-14B-Instruct 69.2% 100.0% aider --model openai/Qwen2.5-Coder-14B-Instruct whole
claude-3-opus-20240229 68.4% 100.0% aider --opus diff
gpt-4-0613 67.7% 100.0% aider -4 diff
Dracarys2-72B-Instruct 66.9% 100.0% (via glhf.chat) whole
gemini-1.5-pro-exp-0827 66.9% 94.7% aider --model gemini/gemini-1.5-pro-exp-0827 diff-fenced
llama-3.1-405b-instruct (whole) 66.2% 100.0% aider --model openrouter/meta-llama/llama-3.1-405b-instruct whole
gpt-4-0314 66.2% 93.2% aider --model gpt-4-0314 diff
gpt-4-0125-preview 66.2% 97.7% aider --model gpt-4-0125-preview udiff
openrouter/qwen/qwen-2.5-coder-32b-instruct 65.4% 84.2% aider --model openrouter/qwen/qwen-2.5-coder-32b-instruct diff
Mistral Large (2411) 65.4% 96.2% aider --model mistral/mistral-large-latest diff
gemini-1.5-pro-002 65.4% 96.2% aider --model gemini/gemini-1.5-pro-002 diff-fenced
qwen-2.5-72b-instruct (bf16) 65.4% 96.2% aider --model openrouter/qwen/qwen-2.5-72b-instruct diff
gpt-4-1106-preview 65.4% 92.5% aider --model gpt-4-1106-preview udiff
ollama/Qwen2.5.1-Coder-7B-Instruct-GGUF:Q8_0-32k 63.9% 100.0% aider --model ollama/Qwen2.5.1-Coder-7B-Instruct-GGUF:Q8_0-32k whole
nousresearch/hermes-3-llama-3.1-405b 63.9% 100.0% aider --model openrouter/nousresearch/hermes-3-llama-3.1-405b whole
llama-3.1-405b-instruct (diff) 63.9% 92.5% aider --model openrouter/meta-llama/llama-3.1-405b-instruct diff
gpt-4-turbo-2024-04-09 (udiff) 63.9% 97.0% aider --gpt-4-turbo udiff
o1-mini 61.1% 100.0% aider --model o1-mini diff
Mistral Large 2 (2407) 60.2% 100.0% aider --model mistral/mistral-large-2407 whole
ollama/qwen2.5:32b-instruct-q8_0 58.6% 100.0% aider --model ollama/qwen2.5:32b-instruct-q8_0 whole
Grok-2 58.6% 98.5% aider --model openrouter/x-ai/grok-2 whole
llama-3.1-70b-instruct 58.6% 100.0% aider --model fireworks_ai/accounts/fireworks/models/llama-v3p1-70b-instruct whole
Qwen2.5-Coder-7B-Instruct 57.9% 100.0% aider --model openai/Qwen2.5-Coder-7B-Instruct whole
gpt-3.5-turbo-0301 57.9% 100.0% aider --model gpt-3.5-turbo-0301 whole
gpt-4-turbo-2024-04-09 (diff) 57.6% 100.0% aider --model gpt-4-turbo-2024-04-09 diff
gemini-1.5-pro-001 57.1% 87.2% aider --model gemini/gemini-1.5-pro-latest diff-fenced
gpt-3.5-turbo-1106 56.1% 100.0% aider --model gpt-3.5-turbo-1106 whole
gpt-4o-mini 55.6% 100.0% aider --model gpt-4o-mini whole
Qwen2 72B Instruct 55.6% 100.0% aider --model together_ai/qwen/Qwen2-72B-Instruct whole
Llama-3.1-Nemotron-70B-Instruct-HF 54.9% 99.2% (via glhf.chat) whole
Grok-2-mini 54.9% 100.0% aider --model openrouter/x-ai/grok-2-mini whole
claude-3-sonnet-20240229 54.9% 100.0% aider --sonnet whole
ollama/qwen2.5:32b 54.1% 100.0% aider --model ollama/qwen2.5:32b whole
Yi Coder 9B Chat 54.1% 100.0% aider --model openai/hf:01-ai/Yi-Coder-9B-Chat --openai-api-base https://glhf.chat/api/openai/v1 whole
gemini-1.5-flash-exp-0827 52.6% 100.0% aider --model gemini/gemini-1.5-flash-exp-0827 whole
qwen2.5-coder:7b-instruct-q8_0 51.9% 100.0% aider --model ollama/qwen2.5-coder:7b-instruct-q8_0 whole
gemini-1.5-flash-002 51.1% 100.0% aider --model gemini/gemini-1.5-flash-002 whole
codestral-2405 51.1% 100.0% aider --model mistral/codestral-2405 whole
gpt-3.5-turbo-0613 50.4% 100.0% aider --model gpt-3.5-turbo-0613 whole
gpt-3.5-turbo-0125 50.4% 100.0% aider -3 whole
qwen2:72b-instruct-q8_0 49.6% 100.0% aider --model ollama/qwen2:72b-instruct-q8_0 whole
llama3-70b-8192 49.2% 73.5% aider --model groq/llama3-70b-8192 diff
Codestral-22B-v0.1-Q4_K_M 48.1% 100.0% aider --model Codestral-22B-v0.1-Q4_K_M whole
codestral:22b-v0.1-q8_0 48.1% 100.0% aider --model ollama/codestral:22b-v0.1-q8_0 whole
claude-3-haiku-20240307 47.4% 100.0% aider --model claude-3-haiku-20240307 whole
ollama/codestral 45.9% 98.5% aider --model ollama/codestral whole
yi-coder:9b-chat-q4_0 45.1% 100.0% aider --model ollama/yi-coder:9b-chat-q4_0 whole
gemini-1.5-flash-latest 44.4% 100.0% aider --model gemini/gemini-1.5-flash-latest whole
WizardLM-2 8x22B 44.4% 100.0% aider --model openrouter/microsoft/wizardlm-2-8x22b whole
ollama/yi-coder:9b-chat-fp16 43.6% 99.2% aider --model ollama/yi-coder:9b-chat-fp16 whole
Reflection-70B 42.1% 100.0% (not currently supported) whole
Qwen2.5-Coder-3B-Instruct 39.1% 100.0% aider --model openai/Qwen2.5-Coder-3B-Instruct whole
ollama/mistral-small 38.3% 99.2% aider --model ollama/mistral-small whole
gemini-1.5-flash-8b-exp-0924 38.3% 100.0% aider --model gemini/gemini-1.5-flash-8b-exp-0924 whole
Command R (08-24) 38.3% 100.0% aider --model command-r-08-2024 whole
Command R+ (08-24) 38.3% 100.0% aider --model command-r-plus-08-2024 whole
gemini-1.5-flash-8b-exp-0827 38.3% 100.0% aider --model gemini/gemini-1.5-flash-8b-exp-0827 whole
llama-3.1-8b-instruct 37.6% 100.0% aider --model fireworks_ai/accounts/fireworks/models/llama-v3p1-8b-instruct whole
qwen1.5-110b-chat 37.6% 100.0% aider --model together_ai/qwen/qwen1.5-110b-chat whole
gemma2:27b-instruct-q8_0 36.1% 100.0% aider --model ollama/gemma2:27b-instruct-q8_0 whole
codeqwen:7b-chat-v1.5-q8_0 34.6% 100.0% aider --model ollama/codeqwen:7b-chat-v1.5-q8_0 whole
ollama/mistral-nemo:12b-instruct-2407-q4_K_M 33.1% 100.0% aider --model ollama/mistral-nemo:12b-instruct-2407-q4_K_M whole
ollama/codegeex4 32.3% 97.0% aider --model ollama/codegeex4 whole
Qwen2.5-Coder-1.5B-Instruct 31.6% 100.0% aider --model openai/Qwen2.5-Coder-1.5B-Instruct whole
command-r-plus 31.6% 100.0% aider --model command-r-plus whole
ollama/hermes3:8b-llama3.1-fp16 30.1% 98.5% aider --model ollama/hermes3:8b-llama3.1-fp16 whole
ollama/wojtek/opencodeinterpreter:6.7b 30.1% 91.0% aider --model ollama/wojtek/opencodeinterpreter:6.7b whole
ollama/llama3.2:3b-instruct-fp16 26.3% 97.0% aider --model ollama/llama3.2:3b-instruct-fp16 whole
ollama/hermes3 22.6% 98.5% aider --model ollama/hermes3 whole
Qwen2.5-Coder-0.5B-Instruct 14.3% 100.0% aider --model openai/Qwen2.5-Coder-0.5B-Instruct whole

Code refactoring leaderboard

Aider’s refactoring benchmark asks the LLM to refactor 89 large methods from large python classes. This is a more challenging benchmark, which tests the model’s ability to output long chunks of code without skipping sections or making mistakes. It was developed to provoke and measure GPT-4 Turbo’s “lazy coding” habit.

The refactoring benchmark requires a large context window to work with large source files. Therefore, results are available for fewer models.

Model Percent completed correctly Percent using correct edit format Command Edit format
claude-3-5-sonnet-20241022 92.1% 91.0% aider --sonnet diff
o1-preview 75.3% 57.3% aider --model o1-preview diff
claude-3-opus-20240229 72.3% 79.5% aider --opus diff
claude-3.5-sonnet-20240620 64.0% 76.4% aider --sonnet diff
gpt-4o 62.9% 53.9% aider diff
gpt-4-1106-preview 50.6% 39.3% aider --model gpt-4-1106-preview udiff
gpt-4o-2024-08-06 49.4% 89.9% aider --model openai/gpt-4o-2024-08-06 diff
gemini/gemini-1.5-pro-latest 49.4% 7.9% aider --model gemini/gemini-1.5-pro-latest diff-fenced
o1-mini 44.9% 29.2% aider --model o1-mini diff
gpt-4-turbo-2024-04-09 (udiff) 34.1% 30.7% aider --gpt-4-turbo udiff
gpt-4-0125-preview 33.7% 47.2% aider --model gpt-4-0125-preview udiff
DeepSeek Coder V2 0724 (deprecated) 32.6% 59.6% aider --model deepseek/deepseek-coder diff
DeepSeek Chat V2.5 31.5% 67.4% aider --deepseek diff
gpt-4-turbo-2024-04-09 (diff) 21.4% 6.8% aider --model gpt-4-turbo-2024-04-09 diff

LLM code editing skill by model release date

connecting to many LLMs

Notes on benchmarking results

The key benchmarking results are:

  • Percent completed correctly - Measures what percentage of the coding tasks that the LLM completed successfully. To complete a task, the LLM must solve the programming assignment and edit the code to implement that solution.
  • Percent using correct edit format - Measures the percent of coding tasks where the LLM complied with the edit format specified in the system prompt. If the LLM makes edit mistakes, aider will give it feedback and ask for a fixed copy of the edit. The best models can reliably conform to the edit format, without making errors.

Notes on the edit format

Aider uses different “edit formats” to collect code edits from different LLMs. The “whole” format is the easiest for an LLM to use, but it uses a lot of tokens and may limit how large a file can be edited. Models which can use one of the diff formats are much more efficient, using far fewer tokens. Models that use a diff-like format are able to edit larger files with less cost and without hitting token limits.

Aider is configured to use the best edit format for the popular OpenAI and Anthropic models and the other models recommended on the LLM page. For lesser known models aider will default to using the “whole” editing format since it is the easiest format for an LLM to use.

Contributing benchmark results

Contributions of benchmark results are welcome! See the benchmark README for information on running aider’s code editing benchmarks. Submit results by opening a PR with edits to the benchmark results data files.