Can ChatGPT Fix Bugs? ‘Wolverine’ Dev Says YES

Welcome to The Long View—where we peruse the news of the week and strip it to the essentials. Let’s work out what really matters.

This week: Instead of AI aids for programming, what about for debugging? The pseudonymous “BioBootloader” says he’s persuaded GPT-4 to make his code self heal.

Regenerative Hyperbole?

Analysis: Smoke ’n’ mirrors

It’s impressive stuff—if only for the clever prompt engineering. But there’s a huge difference between fixing a runtime error and making a program do what the user story says. It’s unclear if Wolverine is a step along the journey or a dead end.

What’s the story? Benj Edwards reports—“Developer creates “regenerative” AI program that fixes bugs on the fly”:

“Not yet fully been explored”
Debugging a faulty program can be frustrating, so why not let AI do it for you? That’s what a developer that goes by “BioBootloader” did by creating Wolverine, a program that can give Python programs “regenerative healing abilities.” … (Yep, just like the Marvel superhero.)
…
In the demo … BioBootloader shows a side-by-side window display, with Python code on the left and Wolverine results on the right in a terminal. He loads a custom calculator script in which he adds a few bugs on purpose, then executes it. … GPT-4 returns an explanation for the program’s errors, shows the changes that it tries to make, then re-runs the program. Upon seeing new errors, GPT-4 fixes the code again, and then it runs correctly.
…
While it’s currently a primitive prototype, techniques like Wolverine illustrate a potential future where apps may be able to fix their own bugs—even unexpected ones that may emerge after deployment. Of course, the implications, safety, and wisdom of allowing that to happen have not yet fully been explored.

Are we living in the future? Donald Papp breathlessly broke the story—“Wolverine Gives Your Python Scripts the Ability to Self-Heal”:

“Carefully-written prompt”
The demo Python script is a simple calculator that works from the command line, and BioBootloader introduces a few bugs to it. He misspells a variable used as a return value, and deletes the subtract_numbers(a, b) function entirely. Running this script by itself simply crashes, but using Wolverine on it has a very different outcome.
…
GPT-4 correctly identifies the two bugs (even though only one of them directly led to the crash) but … Wolverine actually applies the proposed changes to the buggy script, and re-runs it. This time around there is still an error—because GPT-4’s previous changes included an out of scope return statement. No problem, because Wolverine once again consults with GPT-4, creates and formats a change, applies it, and re-runs the modified script. This time the script runs successfully.
…
A large chunk of what Wolverine does is thanks to a carefully-written prompt.

Horse’s mouth? The entity known as BioBootloader generates the next token—and the next and the next and the next:

This is just a quick prototype I threw together in a few hours. There are many possible extensions and contributions are welcome:

add flags to customize usage, such as asking for user confirmation before running changed code
further iterations on the edit format that GPT responds in. Currently it struggles a bit with indentation, but I’m sure that can be improved
a suite of example buggy files that we can test prompts on to ensure reliability and measure improvement
multiple files/codebases: send GPT everything that appears in the stack trace
graceful handling of large files — should we just send GPT relevant classes/functions?
extension to languages other than Python.

Something-something SKYNET? jszymborski snarks it up:

It’s all fun and games until you get a subprocess.run(["rm", "/", "-rf"]) snuck in there that you fail to notice.

SRSLY though? physicsphairy tried a similar thing:

There are limitations to what complexity you can implement with ChatGPT, the biggest being the number of tokens … it can keep in its running memory. That said, it can definitely do a reasonable amount of architecture, including things like coming up with actual UML. Coding a backend you can ask it for a complete list of API endpoints you will need to implement, even the OpenAPI spec for it.
…
It has no problem with a project spanning multiple files, but you either need to specify the file layout or let it come up with it. My experience is that after the conversation proceeds far enough, it starts to “forget,” especially if you have gone down paths you later choose not to follow.
…
It does great at debugging. You can know nothing about a language and just keep giving GPT the error messages/stacktraces and eventually make the right change to get your code working.

That “eventually” is doing a bunch of heavy lifting. unequivocal has been experimenting with ChatGPT as a coding assistant:

It is really great for API integrations and other “garbage” coding where I know what needs to be done, but there are a bunch of finicky elements that have to be integrated. It is especially effective in APIs and interfaces that have poor or no documentation: … It probably cut my development time/effort in half.
…
But running an AI unsupervised on a codebase? Not yet.

Not sure if brilliantly efficient or sheer laziness. Get off Robbie’s lawn:

Very clever … but this would seem to be simply a tool to encourage bad practice: “Oh, an error, I’ll leave it to the AI to fix.” Better to understand and debug your code properly for yourself, and also to use a compiled language where these errors will all be found at compile time rather than randomly appearing at some point during run time.

In any case, it’s a super-limited definition of “self-healing.” Todd Knarr thinks it’s a parlor game:

That’s well and good … but that’s not the big problem. … The big problem isn’t programs that crash. Those are usually caused by bugs that are easy to find and fix.

Talk to me when the AI can take a program that runs perfectly well but produces the wrong output, figure out that the output is wrong, figure out what the right output should be, work back to find where the mistake was, and fix that. And explain its fix.

Meanwhile, andreas-motzek cuts to the chase:

There is a bug if a program does not behave according to the requirements. How does ChatGTP know the requirements?