Qwen3 results on the aider polyglot benchmark

As previously discussed when Qwen2.5 was released, details matter when working with open source models for AI coding. Proprietary models are served by their creators or trusted providers with stable inference settings. Open source models are wonderful because anyone can serve them, but API providers can use very different inference settings, quantizations, etc.

Below are collection of aider polyglot benchmark results for the new Qwen3 models. Results are presented using both “diff” and “whole” edit formats, with various models settings, against various API providers.

See details on the model settings used after the results table.

This article is being updated as new results become available. Also, some results were submitted by aider users and have not been verified.

Qwen3 results on the aider polyglot benchmark

	Model	Percent correct	Cost	Command	Correct edit format	Edit Format
	Qwen3-235B-A22B whole with VLLM, bfloat16, recommended /no_think settings	65.3%		`aider --model openai/Qwen3-235B-A22B`	100.0%	whole
Dirname : 2025-04-30-04-49-37--Qwen3-235B-A22B-whole-nothink Test cases : 225 Model : Qwen3-235B-A22B whole with VLLM, bfloat16, recommended /no_think settings Edit format : whole Commit hash : 0c383df-dirty Pass rate 1 : 28.0 Pass rate 2 : 65.3 Pass num 1 : 63 Pass num 2 : 147 Percent cases well formed : 100.0 Error outputs : 3 Num malformed responses : 0 Num with malformed responses : 0 User asks : 166 Lazy comments : 0 Syntax errors : 0 Indentation errors : 0 Exhausted context windows : 3 Test timeouts : 0 Total tests : 225 Command : `aider --model openai/Qwen3-235B-A22B` Date : 2025-04-30 Versions : 0.81.4.dev Seconds per case : 166.0 Total cost : 0.0
	Qwen3 235B A22B whole, no think, via official Alibaba API	61.8%		`aider --model openai/qwen3-235b-a22b`	100.0%	whole
Dirname : 2025-05-09-23-01-22--qwen3-235b-a22b.unthink_16k_whole Test cases : 225 Model : Qwen3 235B A22B whole, no think, via official Alibaba API Edit format : whole Commit hash : 425fb6d Pass rate 1 : 26.7 Pass rate 2 : 61.8 Pass num 1 : 60 Pass num 2 : 139 Percent cases well formed : 100.0 Error outputs : 0 Num malformed responses : 0 Num with malformed responses : 0 User asks : 175 Lazy comments : 0 Syntax errors : 0 Indentation errors : 0 Exhausted context windows : 0 Prompt tokens : 2768173 Completion tokens : 384000 Test timeouts : 1 Total tests : 225 Command : `aider --model openai/qwen3-235b-a22b` Date : 2025-05-09 Versions : 0.82.4.dev Seconds per case : 50.8 Total cost : 0.0
	Qwen3-235B-A22B diff with VLLM, bfloat16, recommended /no_think settings	61.3%		`aider --model openai/Qwen3-235B-A22B`	94.7%	diff
Dirname : 2025-04-30-04-49-50--Qwen3-235B-A22B-diff-nothink Test cases : 225 Model : Qwen3-235B-A22B diff with VLLM, bfloat16, recommended /no_think settings Edit format : diff Commit hash : 0c383df-dirty Pass rate 1 : 29.8 Pass rate 2 : 61.3 Pass num 1 : 67 Pass num 2 : 138 Percent cases well formed : 94.7 Error outputs : 25 Num malformed responses : 25 Num with malformed responses : 12 User asks : 97 Lazy comments : 0 Syntax errors : 0 Indentation errors : 0 Exhausted context windows : 0 Test timeouts : 2 Total tests : 225 Command : `aider --model openai/Qwen3-235B-A22B` Date : 2025-04-30 Versions : 0.81.4.dev Seconds per case : 158.2 Total cost : 0.0
	Qwen3 235B A22B diff, no think, via official Alibaba API	59.6%		`aider --model openai/qwen3-235b-a22b`	92.9%	diff
Dirname : 2025-05-09-17-02-02--qwen3-235b-a22b.unthink_16k_diff Test cases : 225 Model : Qwen3 235B A22B diff, no think, via official Alibaba API Edit format : diff Commit hash : 91d7fbd-dirty Pass rate 1 : 28.9 Pass rate 2 : 59.6 Pass num 1 : 65 Pass num 2 : 134 Percent cases well formed : 92.9 Error outputs : 22 Num malformed responses : 22 Num with malformed responses : 16 User asks : 111 Lazy comments : 0 Syntax errors : 0 Indentation errors : 0 Exhausted context windows : 0 Prompt tokens : 2816192 Completion tokens : 342062 Test timeouts : 1 Total tests : 225 Command : `aider --model openai/qwen3-235b-a22b` Date : 2025-05-09 Versions : 0.82.4.dev Seconds per case : 45.4 Total cost : 0.0
	Qwen3-235B-A22B whole with llama.cpp, Q5_K_M (unsloth), recommended /no_think settings	59.1%		`aider --model openai/Qwen3-235B-A22B-Q5_K_M`	100.0%	whole
Dirname : 2025-05-07-03-15-59--Qwen3-235B-A22B-Q5_K_M-whole-nothink Test cases : 225 Model : Qwen3-235B-A22B whole with llama.cpp, Q5_K_M (unsloth), recommended /no_think settings Edit format : whole Commit hash : 8159cbf Pass rate 1 : 27.1 Pass rate 2 : 59.1 Pass num 1 : 61 Pass num 2 : 133 Percent cases well formed : 100.0 Error outputs : 1 Num malformed responses : 0 Num with malformed responses : 0 User asks : 169 Lazy comments : 0 Syntax errors : 0 Indentation errors : 0 Exhausted context windows : 0 Test timeouts : 1 Total tests : 225 Command : `aider --model openai/Qwen3-235B-A22B-Q5_K_M` Date : 2025-05-07 Versions : 0.82.4.dev Seconds per case : 635.2 Total cost : 0.0
	Qwen3 235B A22B diff on OpenRouter only TogetherAI, recommended /no_think settings	54.7%	$0.64	`aider --model openrouter/qwen/qwen3-235b-a22b`	90.7%	diff
Dirname : 2025-05-08-17-39-14--qwen3-235b-or-together-only Test cases : 225 Model : Qwen3 235B A22B diff on OpenRouter only TogetherAI, recommended /no_think settings Edit format : diff Commit hash : 328584e Pass rate 1 : 28.0 Pass rate 2 : 54.7 Pass num 1 : 63 Pass num 2 : 123 Percent cases well formed : 90.7 Error outputs : 39 Num malformed responses : 32 Num with malformed responses : 21 User asks : 106 Lazy comments : 0 Syntax errors : 0 Indentation errors : 0 Exhausted context windows : 0 Prompt tokens : 2816606 Completion tokens : 362346 Test timeouts : 2 Total tests : 225 Command : `aider --model openrouter/qwen/qwen3-235b-a22b` Date : 2025-05-08 Versions : 0.82.4.dev Seconds per case : 77.2 Total cost : 0.6399
	Qwen3 235B A22B diff on OpenRouter, all providers, default settings (thinking)	49.8%	$1.8	`aider --model openrouter/qwen/qwen3-235b-a22b`	91.6%	diff
Dirname : 2025-05-08-03-22-37--qwen3-235b-defaults Test cases : 225 Model : Qwen3 235B A22B diff on OpenRouter, all providers, default settings (thinking) Edit format : diff Commit hash : aaacee5-dirty Pass rate 1 : 17.3 Pass rate 2 : 49.8 Pass num 1 : 39 Pass num 2 : 112 Percent cases well formed : 91.6 Error outputs : 58 Num malformed responses : 29 Num with malformed responses : 19 User asks : 102 Lazy comments : 0 Syntax errors : 0 Indentation errors : 0 Exhausted context windows : 0 Prompt tokens : 0 Completion tokens : 0 Test timeouts : 1 Total tests : 225 Command : `aider --model openrouter/qwen/qwen3-235b-a22b` Date : 2025-05-08 Versions : 0.82.4.dev Seconds per case : 428.1 Total cost : 1.8037
	Qwen3-32B whole with VLLM, bfloat16, recommended /no_think settings	45.8%		`aider --model openai/Qwen3-32B`	100.0%	whole
Dirname : 2025-04-30-04-08-41--Qwen3-32B-whole-nothink Test cases : 225 Model : Qwen3-32B whole with VLLM, bfloat16, recommended /no_think settings Edit format : whole Commit hash : 0c383df-dirty Pass rate 1 : 20.4 Pass rate 2 : 45.8 Pass num 1 : 46 Pass num 2 : 103 Percent cases well formed : 100.0 Error outputs : 3 Num malformed responses : 0 Num with malformed responses : 0 User asks : 94 Lazy comments : 0 Syntax errors : 0 Indentation errors : 0 Exhausted context windows : 3 Test timeouts : 5 Total tests : 225 Command : `aider --model openai/Qwen3-32B` Date : 2025-04-30 Versions : 0.81.4.dev Seconds per case : 48.1 Total cost : 0.0
	Qwen3-32B diff with VLLM, bfloat16, recommended /no_think settings	41.3%		`aider --model openai/Qwen3-32B`	94.2%	diff
Dirname : 2025-04-30-04-08-51--Qwen3-32B-diff-nothink Test cases : 225 Model : Qwen3-32B diff with VLLM, bfloat16, recommended /no_think settings Edit format : diff Commit hash : 0c383df-dirty Pass rate 1 : 20.4 Pass rate 2 : 41.3 Pass num 1 : 46 Pass num 2 : 93 Percent cases well formed : 94.2 Error outputs : 17 Num malformed responses : 14 Num with malformed responses : 13 User asks : 83 Lazy comments : 0 Syntax errors : 0 Indentation errors : 0 Exhausted context windows : 3 Test timeouts : 4 Total tests : 225 Command : `aider --model openai/Qwen3-32B` Date : 2025-04-30 Versions : 0.81.4.dev Seconds per case : 59.4 Total cost : 0.0
	Qwen3 32B diff on OpenRouter, all providers, default settings (thinking)	40.0%	$0.76	`aider --model openrouter/qwen/qwen3-32b`	83.6%	diff
Dirname : 2025-05-08-03-20-24--qwen3-32b-default Test cases : 225 Model : Qwen3 32B diff on OpenRouter, all providers, default settings (thinking) Edit format : diff Commit hash : aaacee5-dirty, aeaf259 Pass rate 1 : 14.2 Pass rate 2 : 40.0 Pass num 1 : 32 Pass num 2 : 90 Percent cases well formed : 83.6 Error outputs : 119 Num malformed responses : 50 Num with malformed responses : 37 User asks : 97 Lazy comments : 0 Syntax errors : 0 Indentation errors : 0 Exhausted context windows : 12 Prompt tokens : 317591 Completion tokens : 120418 Test timeouts : 5 Total tests : 225 Command : `aider --model openrouter/qwen/qwen3-32b` Date : 2025-05-08 Versions : 0.82.4.dev Seconds per case : 372.2 Total cost : 0.7603

No think, via official Alibaba API

These results were obtained running against https://dashscope.aliyuncs.com/compatible-mode/v1 with no thinking.

export OPENAI_API_BASE=https://dashscope.aliyuncs.com/compatible-mode/v1
export OPENAI_API_KEY=<key>

- name: openai/qwen3-235b-a22b
  use_temperature: 0.7
  streaming: false
  extra_params:
    stream: false
    max_tokens: 16384
    top_p: 0.8
    top_k: 20
    temperature: 0.7
    enable_thinking: false
    extra_body:
      enable_thinking: false

OpenRouter only TogetherAI, recommended /no_think settings

These results were obtained with the recommended non-thinking model settings in .aider.model.settings.yml:

- name: openrouter/qwen/qwen3-235b-a22b
  system_prompt_prefix: "/no_think"
  use_temperature: 0.7
  extra_params:
    max_tokens: 24000
    top_p: 0.8
    top_k: 20
    min_p: 0.0
    temperature: 0.7
    extra_body:
      provider:
        order: ["Together"]

And then running aider:

aider --model openrouter/qwen/qwen3-235b-a22b

OpenRouter, all providers, default settings (thinking)

These results were obtained by simply running aider as shown below, without any model specific settings. This should have enabled thinking, assuming upstream API providers honor that convention for Qwen3.

aider --model openrouter/qwen/qwen3-xxx

VLLM, bfloat16, recommended /no_think

These benchmarks results were obtained by GitHub user AlongWY with the recommended non-thinking model settings in .aider.model.settings.yml:

- name: openai/<model-name>
  system_prompt_prefix: "/no_think"
  use_temperature: 0.7
  extra_params:
    max_tokens: 24000
    top_p: 0.8
    top_k: 20
    min_p: 0.0
    temperature: 0.7        

And then running aider:

aider --model openai/<model-name> --openai-api-base <url>