Qwen3 results on the aider polyglot benchmark

As previously discussed when Qwen2.5 was released, details matter when working with open source models for AI coding. Proprietary models are served by their creators or trusted providers with stable inference settings. Open source models are wonderful because anyone can serve them, but API providers can use very different inference settings, quantizations, etc.

Below are collection of aider polyglot benchmark results for the new Qwen3 models. Results are presented using both “diff” and “whole” edit formats, with various models settings, against various API providers.

See details on the model settings used after the results table.

This article is being updated as new results become available. Also, some results were submitted by aider users and have not been verified.

Qwen3 results on the aider polyglot benchmark

Model Percent correct Cost Command Correct edit format Edit Format
Qwen3-235B-A22B whole with VLLM, bfloat16, recommended /no_think settings
65.3%
aider --model openai/Qwen3-235B-A22B 100.0% whole
Qwen3-235B-A22B diff with VLLM, bfloat16, recommended /no_think settings
61.3%
aider --model openai/Qwen3-235B-A22B 94.7% diff
Qwen3-235B-A22B whole with llama.cpp, Q5_K_M (unsloth), recommended /no_think settings
59.1%
aider --model openai/Qwen3-235B-A22B-Q5_K_M 100.0% whole
Qwen3 235B A22B diff on OpenRouter only TogetherAI, recommended /no_think settings
54.7%
$0.64
aider --model openrouter/qwen/qwen3-235b-a22b 90.7% diff
Qwen3 235B A22B diff on OpenRouter, all providers, default settings (thinking)
49.8%
$1.8
aider --model openrouter/qwen/qwen3-235b-a22b 91.6% diff
Qwen3-32B whole with VLLM, bfloat16, recommended /no_think settings
45.8%
aider --model openai/Qwen3-32B 100.0% whole
Qwen3-32B diff with VLLM, bfloat16, recommended /no_think settings
41.3%
aider --model openai/Qwen3-32B 94.2% diff
Qwen3 32B diff on OpenRouter, all providers, default settings (thinking)
40.0%
$0.76
aider --model openrouter/qwen/qwen3-32b 83.6% diff

These results were obtained with the recommended non-thinking model settings in .aider.model.settings.yml:

- name: openrouter/qwen/qwen3-235b-a22b
  system_prompt_prefix: "/no_think"
  use_temperature: 0.7
  extra_params:
    max_tokens: 24000
    top_p: 0.8
    top_k: 20
    min_p: 0.0
    temperature: 0.7
    extra_body:
      provider:
        order: ["Together"]

And then running aider:

aider --model openrouter/qwen/qwen3-235b-a22b

OpenRouter, all providers, default settings (thinking)

These results were obtained by simply running aider as shown below, without any model specific settings. This should have enabled thinking, assuming upstream API providers honor that convention for Qwen3.

aider --model openrouter/qwen/qwen3-xxx

These benchmarks results were obtained by GitHub user AlongWY with the recommended non-thinking model settings in .aider.model.settings.yml:

- name: openai/<model-name>
  system_prompt_prefix: "/no_think"
  use_temperature: 0.7
  extra_params:
    max_tokens: 24000
    top_p: 0.8
    top_k: 20
    min_p: 0.0
    temperature: 0.7        

And then running aider:

aider --model openai/<model-name> --openai-api-base <url>