An analysis of the code created by large language models (LLMs) finds that in addition to having a tendency toward creating messy code that introduces security weaknesses and vulnerabilities into the output generated.
Sonar today published a report based on its proprietary analysis framework for assessing LLM-generated code spanning more than 4,400 Java programming assignments. The LLMs evaluated included Anthropic’s Claude Sonnet 4 and 3.7, OpenAI’s GPT-4o, Meta’s Llama-3.2-vision:90b and OpenCoder-8B.
Each model has its own “coding personality” and possesses a strong ability to generate syntactically correct code and boilerplate for common frameworks and functions. For example, Claude Sonnet 4’s success rate of 95.57% on HumanEval demonstrates a very high capability to produce valid, executable code. The models possess a strong foundational understanding of common algorithms and data structures and can create viable solutions for well-defined problems. Additionally, the models are highly effective at translating code concepts and snippets from one programming language to another, which makes them a powerful tool for developers who work with different technology stacks, according to the report.
However, critical flaws such as hard-coded credentials and path-traversal injections were common across all models. While the exact prevalence varies between models, all evaluated LLMs produced a high percentage of vulnerabilities with high severity ratings. For Llama-3.2-vision:90b, over 70% of its vulnerabilities are considered ‘blocker’ level of severity; for GPT-4o, it’s 62.5%; and for Claude Sonnet 4, it is nearly 60%, according to the report.
In fact, the study finds that improved functional performance was often accompanied by much higher levels of risk. While Claude Sonnet 4 improved its performance benchmark pass rate by 6.3% over Claude 3.7 Sonnet, meaning it solved problems more correctly, this performance gain came at a price: The percentage of high-severity bugs rose by 93%.
All models tested also showed a bias toward messy code, with more than 90% of the issues found being so-called “code smells” issues that are indicators of poor structure and low maintainability. Those smells, which include dead and redundant code, increase the long-term technical debt that might be accrued.
Prasenjit Sarkar, solutions marketing manager for Sonar, said the report makes it clear that DevOps teams need to review the code generated by LLMs but also realize that each LLM is essentially an opinionated resource that has been programmed to behave in a specific manner. As such, each developer should take into account some of the inherent biases that have been embedded into each model, he added.
The verbosity of the code generated can also be challenging to debug and fix, noted Sarkar. Many developers often find it challenging to understand code that they did not write. In fact, they may need a separate LLM and an associated set of AI agents to review code created by an LLM.
It’s not clear to what degree DevOps teams have adopted AI coding tools and just how much they trust the output generated. There is no doubt that developers will be more productive, but some of those gains will come at a cost tomorrow that organizations may not realize they are incurring today.