Qwen3 results on the aider polyglot benchmark
As previously discussed when Qwen2.5 was released ,
details matter when working with open source models for AI coding.
Proprietary models are served by their creators or trusted providers with stable inference settings.
Open source models are wonderful because anyone can serve them,
but API providers can use very different inference settings, quantizations, etc.
Below are collection of aider polyglot benchmark results for the new Qwen3 models.
Results are presented using both “diff” and “whole”
edit formats ,
with various models settings, against various API providers.
See details on the
model settings
used after the results table.
This article is being updated as new results become available.
Also, some results were submitted by aider users and have not been verified.
Qwen3 results on the aider polyglot benchmark
Model
Percent correct
Cost
Command
Correct edit format
Edit Format
▶
Qwen3-235B-A22B whole with VLLM, bfloat16, recommended /no_think settings
65.3%
aider --model openai/Qwen3-235B-A22B
100.0%
whole
Dirname
:
2025-04-30-04-49-37--Qwen3-235B-A22B-whole-nothink
Test cases
:
225
Model
:
Qwen3-235B-A22B whole with VLLM, bfloat16, recommended /no_think settings
Edit format
:
whole
Commit hash
:
0c383df-dirty
Pass rate 1
:
28.0
Pass rate 2
:
65.3
Pass num 1
:
63
Pass num 2
:
147
Percent cases well formed
:
100.0
Error outputs
:
3
Num malformed responses
:
0
Num with malformed responses
:
0
User asks
:
166
Lazy comments
:
0
Syntax errors
:
0
Indentation errors
:
0
Exhausted context windows
:
3
Test timeouts
:
0
Total tests
:
225
Command
:
aider --model openai/Qwen3-235B-A22B
Date
:
2025-04-30
Versions
:
0.81.4.dev
Seconds per case
:
166.0
Total cost
:
0.0
▶
Qwen3-235B-A22B diff with VLLM, bfloat16, recommended /no_think settings
61.3%
aider --model openai/Qwen3-235B-A22B
94.7%
diff
Dirname
:
2025-04-30-04-49-50--Qwen3-235B-A22B-diff-nothink
Test cases
:
225
Model
:
Qwen3-235B-A22B diff with VLLM, bfloat16, recommended /no_think settings
Edit format
:
diff
Commit hash
:
0c383df-dirty
Pass rate 1
:
29.8
Pass rate 2
:
61.3
Pass num 1
:
67
Pass num 2
:
138
Percent cases well formed
:
94.7
Error outputs
:
25
Num malformed responses
:
25
Num with malformed responses
:
12
User asks
:
97
Lazy comments
:
0
Syntax errors
:
0
Indentation errors
:
0
Exhausted context windows
:
0
Test timeouts
:
2
Total tests
:
225
Command
:
aider --model openai/Qwen3-235B-A22B
Date
:
2025-04-30
Versions
:
0.81.4.dev
Seconds per case
:
158.2
Total cost
:
0.0
▶
Qwen3-235B-A22B whole with llama.cpp, Q5_K_M (unsloth), recommended /no_think settings
59.1%
aider --model openai/Qwen3-235B-A22B-Q5_K_M
100.0%
whole
Dirname
:
2025-05-07-03-15-59--Qwen3-235B-A22B-Q5_K_M-whole-nothink
Test cases
:
225
Model
:
Qwen3-235B-A22B whole with llama.cpp, Q5_K_M (unsloth), recommended /no_think settings
Edit format
:
whole
Commit hash
:
8159cbf
Pass rate 1
:
27.1
Pass rate 2
:
59.1
Pass num 1
:
61
Pass num 2
:
133
Percent cases well formed
:
100.0
Error outputs
:
1
Num malformed responses
:
0
Num with malformed responses
:
0
User asks
:
169
Lazy comments
:
0
Syntax errors
:
0
Indentation errors
:
0
Exhausted context windows
:
0
Test timeouts
:
1
Total tests
:
225
Command
:
aider --model openai/Qwen3-235B-A22B-Q5_K_M
Date
:
2025-05-07
Versions
:
0.82.4.dev
Seconds per case
:
635.2
Total cost
:
0.0
▶
Qwen3 235B A22B diff on OpenRouter only TogetherAI, recommended /no_think settings
54.7%
$0.64
aider --model openrouter/qwen/qwen3-235b-a22b
90.7%
diff
Dirname
:
2025-05-08-17-39-14--qwen3-235b-or-together-only
Test cases
:
225
Model
:
Qwen3 235B A22B diff on OpenRouter only TogetherAI, recommended /no_think settings
Edit format
:
diff
Commit hash
:
328584e
Pass rate 1
:
28.0
Pass rate 2
:
54.7
Pass num 1
:
63
Pass num 2
:
123
Percent cases well formed
:
90.7
Error outputs
:
39
Num malformed responses
:
32
Num with malformed responses
:
21
User asks
:
106
Lazy comments
:
0
Syntax errors
:
0
Indentation errors
:
0
Exhausted context windows
:
0
Prompt tokens
:
2816606
Completion tokens
:
362346
Test timeouts
:
2
Total tests
:
225
Command
:
aider --model openrouter/qwen/qwen3-235b-a22b
Date
:
2025-05-08
Versions
:
0.82.4.dev
Seconds per case
:
77.2
Total cost
:
0.6399
▶
Qwen3 235B A22B diff on OpenRouter, all providers, default settings (thinking)
49.8%
$1.8
aider --model openrouter/qwen/qwen3-235b-a22b
91.6%
diff
Dirname
:
2025-05-08-03-22-37--qwen3-235b-defaults
Test cases
:
225
Model
:
Qwen3 235B A22B diff on OpenRouter, all providers, default settings (thinking)
Edit format
:
diff
Commit hash
:
aaacee5-dirty
Pass rate 1
:
17.3
Pass rate 2
:
49.8
Pass num 1
:
39
Pass num 2
:
112
Percent cases well formed
:
91.6
Error outputs
:
58
Num malformed responses
:
29
Num with malformed responses
:
19
User asks
:
102
Lazy comments
:
0
Syntax errors
:
0
Indentation errors
:
0
Exhausted context windows
:
0
Prompt tokens
:
0
Completion tokens
:
0
Test timeouts
:
1
Total tests
:
225
Command
:
aider --model openrouter/qwen/qwen3-235b-a22b
Date
:
2025-05-08
Versions
:
0.82.4.dev
Seconds per case
:
428.1
Total cost
:
1.8037
▶
Qwen3-32B whole with VLLM, bfloat16, recommended /no_think settings
45.8%
aider --model openai/Qwen3-32B
100.0%
whole
Dirname
:
2025-04-30-04-08-41--Qwen3-32B-whole-nothink
Test cases
:
225
Model
:
Qwen3-32B whole with VLLM, bfloat16, recommended /no_think settings
Edit format
:
whole
Commit hash
:
0c383df-dirty
Pass rate 1
:
20.4
Pass rate 2
:
45.8
Pass num 1
:
46
Pass num 2
:
103
Percent cases well formed
:
100.0
Error outputs
:
3
Num malformed responses
:
0
Num with malformed responses
:
0
User asks
:
94
Lazy comments
:
0
Syntax errors
:
0
Indentation errors
:
0
Exhausted context windows
:
3
Test timeouts
:
5
Total tests
:
225
Command
:
aider --model openai/Qwen3-32B
Date
:
2025-04-30
Versions
:
0.81.4.dev
Seconds per case
:
48.1
Total cost
:
0.0
▶
Qwen3-32B diff with VLLM, bfloat16, recommended /no_think settings
41.3%
aider --model openai/Qwen3-32B
94.2%
diff
Dirname
:
2025-04-30-04-08-51--Qwen3-32B-diff-nothink
Test cases
:
225
Model
:
Qwen3-32B diff with VLLM, bfloat16, recommended /no_think settings
Edit format
:
diff
Commit hash
:
0c383df-dirty
Pass rate 1
:
20.4
Pass rate 2
:
41.3
Pass num 1
:
46
Pass num 2
:
93
Percent cases well formed
:
94.2
Error outputs
:
17
Num malformed responses
:
14
Num with malformed responses
:
13
User asks
:
83
Lazy comments
:
0
Syntax errors
:
0
Indentation errors
:
0
Exhausted context windows
:
3
Test timeouts
:
4
Total tests
:
225
Command
:
aider --model openai/Qwen3-32B
Date
:
2025-04-30
Versions
:
0.81.4.dev
Seconds per case
:
59.4
Total cost
:
0.0
▶
Qwen3 32B diff on OpenRouter, all providers, default settings (thinking)
40.0%
$0.76
aider --model openrouter/qwen/qwen3-32b
83.6%
diff
Dirname
:
2025-05-08-03-20-24--qwen3-32b-default
Test cases
:
225
Model
:
Qwen3 32B diff on OpenRouter, all providers, default settings (thinking)
Edit format
:
diff
Commit hash
:
aaacee5-dirty, aeaf259
Pass rate 1
:
14.2
Pass rate 2
:
40.0
Pass num 1
:
32
Pass num 2
:
90
Percent cases well formed
:
83.6
Error outputs
:
119
Num malformed responses
:
50
Num with malformed responses
:
37
User asks
:
97
Lazy comments
:
0
Syntax errors
:
0
Indentation errors
:
0
Exhausted context windows
:
12
Prompt tokens
:
317591
Completion tokens
:
120418
Test timeouts
:
5
Total tests
:
225
Command
:
aider --model openrouter/qwen/qwen3-32b
Date
:
2025-05-08
Versions
:
0.82.4.dev
Seconds per case
:
372.2
Total cost
:
0.7603
OpenRouter only TogetherAI, recommended /no_think settings
These results were obtained with the
recommended
non-thinking model settings in .aider.model.settings.yml
:
- name : openrouter/qwen/qwen3-235b-a22b
system_prompt_prefix : " /no_think"
use_temperature : 0.7
extra_params :
max_tokens : 24000
top_p : 0.8
top_k : 20
min_p : 0.0
temperature : 0.7
extra_body :
provider :
order : [ " Together" ]
And then running aider:
aider --model openrouter/qwen/qwen3-235b-a22b
OpenRouter, all providers, default settings (thinking)
These results were obtained by simply running aider as shown below, without any model specific settings.
This should have enabled thinking, assuming upstream API providers honor that convention for Qwen3.
aider --model openrouter/qwen/qwen3-xxx
VLLM, bfloat16, recommended /no_think
These benchmarks results were obtained by GitHub user AlongWY
with the
recommended
non-thinking model settings in .aider.model.settings.yml
:
- name : openai/<model-name>
system_prompt_prefix : " /no_think"
use_temperature : 0.7
extra_params :
max_tokens : 24000
top_p : 0.8
top_k : 20
min_p : 0.0
temperature : 0.7
And then running aider:
aider --model openai/<model-name> --openai-api-base <url>