The January GPT-4 Turbo is lazier than the last version
OpenAI just released a new version of GPT-4 Turbo.
This new model is intended to reduce the “laziness” that has been widely observed with the previous gpt-4-1106-preview
model:
Today, we are releasing an updated GPT-4 Turbo preview model, gpt-4-0125-preview. This model completes tasks like code generation more thoroughly than the previous preview model and is intended to reduce cases of “laziness” where the model doesn’t complete a task.
With that in mind, I’ve been benchmarking the new model using aider’s existing lazy coding benchmark.
Benchmark results
Overall,
the new gpt-4-0125-preview
model seems lazier
than the November gpt-4-1106-preview
model:
- It gets worse benchmark scores when using the unified diffs code editing format.
- Using aider’s older SEARCH/REPLACE block editing format, the new January model outperforms the older November model. But it still performs worse than both models using unified diffs.
Related reports
This is one in a series of reports that use the aider benchmarking suite to assess and compare the code editing capabilities of OpenAI’s GPT models. You can review the other reports for additional information:
- GPT code editing benchmarks evaluates the March and June versions of GPT-3.5 and GPT-4.
- Code editing benchmarks for OpenAI’s “1106” models.
- Aider’s lazy coding benchmark.