May 13, 2024

LLM coding skill over time

Drawing graphs with aider, GPT-4o and matplotlib

I recently wanted to draw a graph showing how LLM code editing skill has been changing over time as new models have been released by OpenAI, Anthropic and others. I have all the data in a YAML file that is used to render aider’s LLM leaderboards.

Below is the aider chat transcript, which shows:

I launch aider with the YAML file, a file with other plots I’ve done recently (so GPT can crib the style) and an empty file called over_time.py.
Then I ask GPT to draw the scatterplot I want.
I run the resulting script and share the error output with GPT so it can fix a small bug.
I ask it to color the points for GPT-4 and GPT-3.5 family models differently, to better see trends within those model families.
I add a PNG of the graph to the chat, so GPT can see the current graph and understand why I want it to move the point labels so they don’t overlap the points.
I work through a series of other small style changes, like changing fonts and the graph border.

In the end I have the graph, but I also have the python code in my repo. So I can update this graph easily whenever I add new entries to the YAML data file.

Aider chat transcript

Aider v0.35.1-dev
Models: gpt-4o with diff edit format, weak model gpt-3.5-turbo
Added _data/edit_leaderboard.yml to the chat.
Added benchmark/over_time.py to the chat.
Added benchmark/plots.py to the chat.

in over_time.py, draw a graph for each of the yaml entries with a `released` date. plot date on x, `pass_rate_2` on y and label with the `model` name. do a scatter plot. use colors/styles similar to the other graphs.

To draw a scatter plot for each of the YAML entries with a released date, we need to:

Read the YAML file.
Extract the released date and pass_rate_2 for each entry.
Plot the data using matplotlib.

Here are the changes:

benchmark/over_time.py

<<<<<<< SEARCH
=======
import matplotlib.pyplot as plt
import yaml
from datetime import datetime

def plot_over_time(yaml_file):
    with open(yaml_file, 'r') as file:
        data = yaml.safe_load(file)

    dates = []
    pass_rates = []
    models = []

    for entry in data:
        if 'released' in entry and 'pass_rate_2' in entry:
            dates.append(datetime.strptime(entry['released'], '%Y-%m-%d'))
            pass_rates.append(entry['pass_rate_2'])
            models.append(entry['model'])

    plt.figure(figsize=(10, 6))
    plt.scatter(dates, pass_rates, c='blue', alpha=0.5)

    for i, model in enumerate(models):
        plt.annotate(model, (dates[i], pass_rates[i]), fontsize=8, alpha=0.75)

    plt.xlabel('Release Date')
    plt.ylabel('Pass Rate 2')
    plt.title('Model Performance Over Time')
    plt.grid(True)
    plt.tight_layout()
    plt.show()

# Example usage
plot_over_time('_data/edit_leaderboard.yml')
>>>>>>> REPLACE

Applied edit to benchmark/over_time.py
Commit 49ac9ca aider: Added scatter plot functionality for YAML entries with a released date and pass_rate_2.

/run python benchmark/over_time.py

Traceback (most recent call last): File “/Users/gauthier/Projects/aider/benchmark/over_time.py”, line 33, in plot_over_time('_data/edit_leaderboard.yml') File "/Users/gauthier/Projects/aider/benchmark/over_time.py", line 15, in plot_over_time dates.append(datetime.strptime(entry['released'], '%Y-%m-%d')) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ TypeError: strptime() argument 1 must be str, not datetime.date Add the output to the chat? y Traceback (most recent call last): File "/Users/gauthier/Projects/aider/benchmark/over_time.py", line 33, in plot_over_time('_data/edit_leaderboard.yml') File "/Users/gauthier/Projects/aider/benchmark/over_time.py", line 15, in plot_over_time dates.append(datetime.strptime(entry['released'], '%Y-%m-%d')) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ TypeError: strptime() argument 1 must be str, not datetime.date

The error indicates that the released field in the YAML file is already a datetime.date object, so we don’t need to parse it with datetime.strptime. We can directly use it in the plot.

Here are the changes:

benchmark/over_time.py

<<<<<<< SEARCH
            dates.append(datetime.strptime(entry['released'], '%Y-%m-%d'))
=======
            dates.append(entry['released'])
>>>>>>> REPLACE

Applied edit to benchmark/over_time.py
Commit af2f4b7 aider: Use the released field directly in the plot instead of parsing it with datetime.strptime.

color the “gpt-4” models differently

To color the “gpt-4” models differently, we need to modify the plotting code in benchmark/over_time.py to check for “gpt-4” in the model name and assign a different color.

Here are the changes:

benchmark/over_time.py

<<<<<<< SEARCH
    plt.scatter(dates, pass_rates, c='blue', alpha=0.5)
=======
    colors = ['red' if 'gpt-4' in model else 'blue' for model in models]
    plt.scatter(dates, pass_rates, c=colors, alpha=0.5)
>>>>>>> REPLACE

This change will color the points red if the model name contains “gpt-4”, otherwise, it will color them blue.

Applied edit to benchmark/over_time.py
Commit 0d11357 aider: Color the “gpt-4” models differently.

color the gpt-3.5 models a new color

To color the “gpt-3.5” models a new color, we need to modify the plotting code in benchmark/over_time.py to check for “gpt-3.5” in the model name and assign a different color.

Here are the changes:

benchmark/over_time.py

<<<<<<< SEARCH
    colors = ['red' if 'gpt-4' in model else 'blue' for model in models]
=======
    colors = ['red' if 'gpt-4' in model else 'green' if 'gpt-3.5' in model else 'blue' for model in models]
>>>>>>> REPLACE

Applied edit to benchmark/over_time.py
Commit 85d9d84 aider: Color the “gpt-3.5” models green in the plotting code.

use the colorscheme and fonts from the other graphs in plots.py

To use the colorscheme and fonts from the graphs in plots.py, we need to:

Import the necessary settings from plots.py.
Apply these settings to the graph in over_time.py.

Here are the changes:

benchmark/over_time.py

<<<<<<< SEARCH
import yaml
from datetime import datetime
=======
import yaml
from datetime import datetime
from matplotlib import rc
>>>>>>> REPLACE

benchmark/over_time.py

<<<<<<< SEARCH
    plt.figure(figsize=(10, 6))
=======
    plt.rcParams["hatch.linewidth"] = 0.5
    plt.rcParams["hatch.color"] = "#444444"

    rc("font", **{"family": "sans-serif", "sans-serif": ["Helvetica"], "size": 10})

    fig, ax = plt.subplots(figsize=(6, 4))
    ax.grid(axis="y", zorder=0, lw=0.2)
>>>>>>> REPLACE

benchmark/over_time.py

<<<<<<< SEARCH
    plt.scatter(dates, pass_rates, c=colors, alpha=0.5)
=======
    ax.scatter(dates, pass_rates, c=colors, alpha=0.5)
>>>>>>> REPLACE

benchmark/over_time.py

<<<<<<< SEARCH
        plt.annotate(model, (dates[i], pass_rates[i]), fontsize=8, alpha=0.75)

    plt.xlabel('Release Date')
    plt.ylabel('Pass Rate 2')
    plt.title('Model Performance Over Time')
    plt.grid(True)
    plt.tight_layout()
    plt.show()
=======
        ax.annotate(model, (dates[i], pass_rates[i]), fontsize=8, alpha=0.75)

    ax.set_xlabel('Release Date')
    ax.set_ylabel('Pass Rate 2')
    ax.set_title('Model Performance Over Time')
    plt.tight_layout()
    plt.savefig("tmp_over_time.svg")
    imgcat(fig)
>>>>>>> REPLACE

Applied edit to benchmark/over_time.py
Commit 8d670f1 aider: Used the colorscheme and fonts from the graphs in plots.py in over_time.py.

/add tmp_over_time.png

Added /Users/gauthier/Projects/aider/tmp_over_time.png to the chat

make the labels not overlap the dots

To make the labels not overlap the dots, we can adjust the position of the labels slightly using the xytext parameter in the annotate function.

Here are the changes:

benchmark/over_time.py

<<<<<<< SEARCH
        ax.annotate(model, (dates[i], pass_rates[i]), fontsize=8, alpha=0.75)
=======
        ax.annotate(model, (dates[i], pass_rates[i]), fontsize=8, alpha=0.75,
                    xytext=(5, 5), textcoords='offset points')
>>>>>>> REPLACE

Committing benchmark/over_time.py before applying edits.
Commit 49eb72c Updated image file format from SVG to PNG and added imgcat library for displaying the plot.
Applied edit to benchmark/over_time.py
Commit 75fa061 aider: Adjust the position of labels to prevent overlap with dots.

make the dots bigger

To make the dots bigger, we can adjust the s parameter in the scatter function, which controls the size of the dots.

Here are the changes:

benchmark/over_time.py

<<<<<<< SEARCH
    ax.scatter(dates, pass_rates, c=colors, alpha=0.5)
=======
    ax.scatter(dates, pass_rates, c=colors, alpha=0.5, s=100)
>>>>>>> REPLACE

Applied edit to benchmark/over_time.py
Commit a018fdf aider: Make the dots bigger.

make the border of the graph thinner and lighter

To make the border of the graph thinner and lighter, we can adjust the spines properties of the axes.

Here are the changes:

benchmark/over_time.py

<<<<<<< SEARCH
    colors = ['red' if 'gpt-4' in model else 'green' if 'gpt-3.5' in model else 'blue' for model in models]
=======
    for spine in ax.spines.values():
        spine.set_edgecolor('#DDDDDD')
        spine.set_linewidth(0.5)

    colors = ['red' if 'gpt-4' in model else 'green' if 'gpt-3.5' in model else 'blue' for model in models]
>>>>>>> REPLACE

Applied edit to benchmark/over_time.py
Commit 77ba518 aider: Made the border of the graph thinner and lighter by adjusting the spines properties of the axes.

Drawing graphs with aider, GPT-4o and matplotlib

Aider chat transcript

in over_time.py, draw a graph for each of the yaml entries with a released date. plot date on x, pass_rate_2 on y and label with the model name. do a scatter plot. use colors/styles similar to the other graphs.

/run python benchmark/over_time.py

color the “gpt-4” models differently

color the gpt-3.5 models a new color

use the colorscheme and fonts from the other graphs in plots.py

/add tmp_over_time.png

make the labels not overlap the dots

make the dots bigger

make the border of the graph thinner and lighter

in over_time.py, draw a graph for each of the yaml entries with a `released` date. plot date on x, `pass_rate_2` on y and label with the `model` name. do a scatter plot. use colors/styles similar to the other graphs.