The January GPT-4 Turbo is lazier than the last version

OpenAI just released a new version of GPT-4 Turbo. This new model is intended to reduce the “laziness” that has been widely observed with the previous gpt-4-1106-preview model:

Today, we are releasing an updated GPT-4 Turbo preview model, gpt-4-0125-preview. This model completes tasks like code generation more thoroughly than the previous preview model and is intended to reduce cases of “laziness” where the model doesn’t complete a task.

With that in mind, I’ve been benchmarking the new model using aider’s existing lazy coding benchmark.

Benchmark results

Overall, the new gpt-4-0125-preview model seems lazier than the November gpt-4-1106-preview model:

It gets worse benchmark scores when using the unified diffs code editing format.
Using aider’s older SEARCH/REPLACE block editing format, the new January model outperforms the older November model. But it still performs worse than both models using unified diffs.

This is one in a series of reports that use the aider benchmarking suite to assess and compare the code editing capabilities of OpenAI’s GPT models. You can review the other reports for additional information:

GPT code editing benchmarks evaluates the March and June versions of GPT-3.5 and GPT-4.
Code editing benchmarks for OpenAI’s “1106” models.
Aider’s lazy coding benchmark.

The January GPT-4 Turbo is lazier than the last version

Benchmark results

Related reports