Sonnet seems as good as ever

Recently there has been a lot of speculation that Sonnet has been dumbed-down, nerfed or is otherwise performing worse. Sonnet seems as good as ever, when performing the aider code editing benchmark via the API.

Below is a graph showing the performance of Claude 3.5 Sonnet over time. It shows every clean, comparable benchmark run performed since Sonnet launched. Benchmarks were performed for various reasons, usually to evaluate the effects of small changes to aider’s system prompts.

The graph shows variance, but no indication of a noteworthy degradation. There is always some variance in benchmark results, typically +/- 2% between runs with identical prompts.

It’s worth noting that these results would not capture any changes made to Anthropic web chat’s use of Sonnet.

This graph shows the performance of Claude 3.5 Sonnet on aider’s code editing benchmark over time. ‘Pass Rate 1’ represents the initial success rate, while ‘Pass Rate 2’ shows the success rate after a second attempt with a chance to fix testing errors. The aider LLM code editing leaderboard ranks models based on Pass Rate 2.