December 03, 2024

QwQ is a code architect, not an editor

QwQ 32B Preview is a “reasoning” model, which spends a lot of tokens thinking before rendering a final response. This is similar to OpenAI’s o1 models, which are most effective with aider when paired as an architect with a traditional LLM as an editor. In this mode, the reasoning model acts as an “architect” to propose a solution to the coding problem without regard for how to actually make edits to the source files. The “editor” model receives that proposal, and focuses solely on how to edit the existing source code to implement it.

Used alone without being paired with an editor, QwQ was unable to comply with even the simplest editing format. It was not able to reliably edit source code files. As a result, QwQ’s solo score on the benchmark was quite underwhelming (and far worse than the o1 models performing solo).

QwQ is based on Qwen 2.5 Coder 32B Instruct, and does better when paired with it as an architect + editor combo. Though this provided only a modest benchmark improvement over just using Qwen alone, and comes with a fairly high cost in terms of latency. Each request must wait for QwQ to return all its thinking text and the final solution proposal. And then one must wait for Qwen to turn that large response into actual file edits.

Pairing QwQ with other sensible editor models performed the same or worse than just using Qwen 2.5 Coder 32B Instruct alone.

QwQ+Qwen seems to be the best way to use QwQ, achieving a score of 74%. That is well below the SOTA results for this benchmark: Sonnet alone scores 84%, and o1-preview + o1-mini as architect + editor scores 85%.

QwQ specific editing formats

I spent some time experimenting with a variety of custom editing formats for QwQ. In particular, I tried to parse the QwQ response and discard the long sections of “thinking” and retain only the “final” solution. None of this custom work seemed to translate into any significant improvement in the benchmark results.

Results

Model	Percent completed correctly	Percent using correct edit format	Command	Edit format
o1-preview	79.7%	93.2%	`aider --model o1-preview`	diff
QwQ + Qwen2.5 Coder 32B-I	73.6%	100.0%	`aider --model openrouter/qwen/qwq-32b-preview --editor-model openrouter/qwen/qwen-2.5-coder-32b-instruct --editor-edit-format editor-whole`	architect
Qwen2.5 Coder 32B-I	71.4%	94.7%	`aider --model openai/hf:Qwen/Qwen2.5-Coder-32B-Instruct --openai-api-base https://glhf.chat/api/openai/v1 (via GLHF)`	diff
QwQ + Haiku	71.4%	100.0%	`aider --model openrouter/qwen/qwq-32b-preview --editor-model claude-3-5-haiku-20241022 --edit-format editor-whole`	architect
o1-mini	70.7%	90.0%	`aider --model o1-mini`	whole
QwQ + DeepSeek V2.5	67.7%	100.0%	`aider --model openrouter/qwen/qwq-32b-preview --editor-model deepseek/deepseek-chat --edit-format editor-whole`	architect
QwQ	42.1%	91.0%	`aider --model openrouter/qwen/qwq-32b-preview`	whole

Open source model caveats

As discussed in a recent blog post, details matter with open source models. For clarity, new benchmark runs for this article were performed against OpenRouter’s endpoints for QwQ 32B Preview and Qwen 2.5 Coder 32B Instruct. For the other models, the benchmark was direct to their providers’ APIs.

Having recently done extensive testing of OpenRouter’s Qwen 2.5 Coder 32B Instruct endpoint, it seems reliable. The provider Mancer was blocked due to the small context window it provides.

For QwQ 32B Preview, Fireworks was blocked because of its small context window.