A report published this week by Sonar finds the GPT-5 platform released by OpenAI has the potential to generate better code but at significantly higher costs.
Based on over 4,400 Java tasks, the report finds that depending on which of the four levels of reasoning capabilities that OpenAI now makes available, the overall quality of the code, especially in terms of the vulnerabilities generated, significantly improves.
However, the overall volume of code being generated per task also substantially increases, which creates additional maintenance challenges for application developers that are not going to be familiar with how code might have been constructed in the first place. Overall, the report finds the minimal edition of GPT-5 produces more than twice the lines of code per tasks assigned than the previous GPT-4o edition of the platform.
There are also plusses and minuses when it comes to security. For example, the report noted that higher reasoning eliminates common, well-understood attacks such as “path-traversal & injection” vulnerabilities. However, these are replaced by subtle, harder-to-detect flaws. The percentage of vulnerabilities related to “inadequate I/O error-handling” increases to 44% in the high reasoning mode versus 30% in the minimum reasoning mode.
A similar trade-off occurs with bugs. As reasoning increases, the rate of fundamental “control-flow mistake” bugs decreases significantly. However, the percentage of advanced “concurrency/threading” bugs increases from 20% in minimal mode to approximately 38% in high mode.
Donald Fischer, a Sonar vice president, said the report makes it clear there is a tradeoff between using, for example, the minimal level reasoning capability priced at $22 per month per developer, and the highest level priced at $189 per developer per month. While there is little doubt that AI coding tools can increase productivity, DevOps teams need to make sure they verify the quality of the output being provided based on multiple dimensions, including the amount of cost being incurred, he added.
The latest ChatGPT-5 analysis builds on a previous Sonar report that assessed the code generated by large language models (LLMs) from OpenAI, Anthropic and Meta, which found each model has its own “coding personality” that affects the quality of the code created. In general, the models possess a strong foundational understanding of common algorithms and data structures and can create viable solutions for well-defined problems.
Additionally, the models are highly effective at translating code concepts and snippets from one programming language to another, which makes them a powerful tool for developers who work with different technology stacks, according to the report. However, critical flaws such as hard-coded credentials and path-traversal injections were common across all models. While the exact prevalence varies between models, all the LLMs previously evaluated produced a high percentage of vulnerabilities with high severity ratings.
It’s not clear to what degree DevOps teams have adopted AI coding tools and just how much they trust the output being generated. There is no doubt that developers will be more productive, especially when building prototypes, but how much of the code makes it into production environments is unknown. The one thing that is clear is that some of the productivity gains being made will eventually come at a cost that many DevOps teams may not realize they are already starting to incur.