Refactoring leaderboard

Aider’s refactoring benchmark asks the LLM to refactor 89 large methods from large python classes. This is a more challenging benchmark, which tests the model’s ability to output long chunks of code without skipping sections or making mistakes. It was developed to provoke and measure GPT-4 Turbo’s “lazy coding” habit.

The refactoring benchmark requires a large context window to work with large source files. Therefore, results are available for fewer models.

Model Percent completed correctly Percent using correct edit format Command Edit format
claude-3-5-sonnet-20241022 92.1% 91.0% aider --sonnet diff
o1-preview 75.3% 57.3% aider --model o1-preview diff
claude-3-opus-20240229 72.3% 79.5% aider --opus diff
claude-3.5-sonnet-20240620 64.0% 76.4% aider --sonnet diff
gpt-4o 62.9% 53.9% aider diff
gpt-4-1106-preview 50.6% 39.3% aider --model gpt-4-1106-preview udiff
gpt-4o-2024-08-06 49.4% 89.9% aider --model openai/gpt-4o-2024-08-06 diff
gemini/gemini-1.5-pro-latest 49.4% 7.9% aider --model gemini/gemini-1.5-pro-latest diff-fenced
o1-mini 44.9% 29.2% aider --model o1-mini diff
gpt-4-turbo-2024-04-09 (udiff) 34.1% 30.7% aider --gpt-4-turbo udiff
gpt-4-0125-preview 33.7% 47.2% aider --model gpt-4-0125-preview udiff
DeepSeek Coder V2 0724 (deprecated) 32.6% 59.6% aider --model deepseek/deepseek-coder diff
DeepSeek Chat V2.5 31.5% 67.4% aider --deepseek diff
gpt-4-turbo-2024-04-09 (diff) 21.4% 6.8% aider --model gpt-4-turbo-2024-04-09 diff