Is AI-generated code safe to ship? A senior engineer's guide

AI code qualityZegaware engineering2 June 20269 min read

Last updated: 2 June 2026

AI-generated code is usually safe to run and frequently not safe to ship. Those are different questions, and the gap between them is large and well measured: in a Carnegie Mellon benchmark, agent-written code was functionally correct about 61% of the time but secure only 10.5% of the time [1]. The honest answer is that AI-built software can be made safe to ship, but not by default, and not without a senior review gate deciding what is safe now, what must be fixed first, and what should not have been built by an agent at all.

This guide is for founders and engineering leaders who have shipped, or are about to ship, software that an AI tool helped write, and want a clear-eyed view of whether it is ready.

"It works" and "it is safe to ship" are different questions

The demo answers one question: does it work? Shipping to customers asks a harder one: is it safe, correct and operable when real people, real data and real attackers arrive? AI coding tools are extraordinarily good at the first and largely indifferent to the second, because they optimise for code that plausibly satisfies the prompt, not code that holds up under conditions nobody asked about.

The single most useful number we know comes from Carnegie Mellon's study of AI coding agents across two hundred real programming tasks: about 61% of the solutions were functionally correct, but only 10.5% were secure, and more than 80% of the working solutions still contained critical vulnerabilities [1]. Producing software that runs and producing software that is safe are, on this evidence, nearly independent outcomes. The demo only ever tests the first.

The evidence is consistent across the industry

This is not one outlier study. Veracode's testing across more than 150 models found that 45% of AI-generated code introduces at least one vulnerability from the OWASP Top 10, a figure that did not improve across two further years of frontier model releases. As Veracode put it: "The productivity revolution is here. The security revolution isn't" [2]. The OWASP Top 10 for LLM Applications exists precisely because these failure modes are now common enough to standardise [3].

The pattern holds for the specific problems we find most often:

Secrets in the source. GitGuardian found AI-assisted commits leak secrets at roughly twice the rate of human-only commits, with 28.65 million new secrets pushed to public repositories in a single year [4].
More defects per change. CodeRabbit's analysis of pull requests found AI-authored changes carried about 1.7 times more issues than human-authored ones [5].
Dependencies that do not exist. Academic researchers found that 19.7% of AI-generated code samples referenced a non-existent package, a pattern attackers now exploit by registering the predictable fictional names [6].

If you have built with AI, none of this means your software is bad. It means it is a first draft from a fast, capable, and unsupervised junior, and it needs the review any such draft would.

Why AI-built code is unsafe by default

The reasons are structural, not incidental, which is why the numbers do not improve on their own:

It models the prompt, not the threat. Asked to "return the user's orders", a model returns orders. It does not reflexively ask which user is allowed to see which orders, so missing authorisation is the most common serious finding we see.
It learned from insecure examples. Public training data is full of inline keys, string-concatenated SQL and happy-path code, so the model reproduces those patterns confidently.
It writes tests that confirm the code ran, not that it is correct. A passing suite that never exercises an unauthorised request or a malformed input proves very little.
It optimises locally. Each session solves its task in isolation, so a codebase accumulates three ways to do the same thing and no single place to fix anything.

The UK's National Cyber Security Centre frames the response well: the goal is deterministic controls, implemented in rules and code, that constrain what AI-written code can do even when it is flawed, rather than trusting the model to police itself [7].

What "safe to ship" actually means

"Safe to ship" is not a feeling. It is a decision a named senior engineer is willing to sign, and it rests on a simple idea: some code an AI tool may draft, and some code a human must own. We work in three lanes.

Green: an AI tool may draft it, a human reviews it. Presentational components, internal utilities, well-bounded transformations. Low blast radius if wrong.
Amber: an AI tool may draft it, a human rewrites the risky parts. Anything touching external input, data access patterns, or third-party integrations.
Red: a human writes it, full stop. Authentication, authorisation, cryptography, payments, data migrations and secrets handling. An AI model may sketch these, but no one ships them without a human author and a security review.

The deliverable of a review is not a list of complaints. It is that decision, made explicitly and signed: what is safe to ship now, what must be fixed first ranked by severity, and what should be rebuilt rather than patched.

How to make AI-built code safe to ship

The path is not "stop using AI". It is to put a deliberate review gate between generation and production. In practice that means a senior engineer reading the code against recognised standards (the OWASP Top 10 for LLM Applications for any AI feature [3], and a conventional application-security baseline for the rest), checking the specific failure patterns that AI reliably produces, and making the lane decision above for the risky surfaces.

The recurring findings are remarkably consistent, which is what makes a structured review fast and worthwhile: exposed secrets, injection, missing authorisation, shallow tests, silent failures, architectural drift and hallucinated dependencies. We have written up exactly what that review surfaces, in the order we tend to find it, in what a vibe code audit actually finds. That piece is the detailed companion to this one.

What it costs to get this wrong

For a funded founder, unsafe AI-built code is due diligence risk. Technical due diligence is now a standard, and often decisive, part of UK funding rounds, and a codebase with committed secrets, no failure-path tests and several implementations of the same logic is exactly what a technical reviewer finds in an afternoon. The cost of fixing it after a term sheet is far higher than a review beforehand.

For an enterprise leader, it is liability and accumulated drag. Forrester predicts that 75% of technology decision-makers will see their technical debt rise to a moderate or high level of severity by 2026, with the rapid development of AI solutions named as an accelerant [8]. A review before you scale is cheaper than the cleanup, and far cheaper than the breach.

Frequently asked questions

Is AI-generated code ever safe to ship?

Yes, when it has passed a senior review gate. AI-built software that has been read against recognised standards, with its risky surfaces (authentication, authorisation, data access, payments) rewritten or verified by a human, ships safely every day. The risk comes from shipping AI output unreviewed, on the assumption that "it works in the demo" means "it is safe in production". Those are different claims.

Does this mean we should stop using AI to write code?

No. AI coding tools genuinely accelerate delivery, and that is worth keeping. The point is that their output needs the same senior review any first draft requires, applied before customers rely on it. Used with a clear review gate, AI-built software is safe to ship; used without one, it ships the recurring problems the evidence above describes.

How do we check whether our AI-built code is safe to ship?

Have a senior engineer review it against the OWASP Top 10 for LLM Applications and a conventional security baseline, with particular attention to secrets, authorisation, input handling, test quality and dependencies, then make an explicit decision about each risky surface. If you do not have that capability in house, an external review answers the question quickly and puts a name to the verdict.

How long does a review take?

For most products, from a few days to a couple of weeks, depending on the size of the codebase and how much AI-generated code it contains. The output is a written verdict you can act on, not a raw tool dump.

The honest answer

AI lets a small team build something real in a fraction of the time it used to take, and that is a genuine advance. It does not change the one question the demo cannot answer: is it safe to ship? If you have built something with AI and you want a senior engineer to tell you honestly where it stands, book a Vibe Code Audit. We will tell you what we find, and we will put our name to it.

Sources

Songwen Zhao et al., "Is Vibe Coding Safe? Benchmarking Vulnerability of Agent-Generated Code in Real-World Tasks" (SUSVIBES benchmark), Carnegie Mellon University, arXiv:2512.03262, 2026. https://arxiv.org/abs/2512.03262
Veracode, Spring 2026 GenAI Code Security Update, 24 March 2026. https://www.veracode.com/blog/spring-2026-genai-code-security/
OWASP, Top 10 for LLM Applications 2025. https://genai.owasp.org/resource/owasp-top-10-for-llm-applications-2025/
GitGuardian, The State of Secrets Sprawl 2026, 17 March 2026. https://blog.gitguardian.com/the-state-of-secrets-sprawl-2026/
CodeRabbit, State of AI vs Human Code Generation Report, 17 December 2025. https://www.coderabbit.ai/blog/state-of-ai-vs-human-code-generation-report
Joseph Spracklen et al., "We Have a Package for You! A Comprehensive Analysis of Package Hallucinations by Code Generating LLMs", USENIX Security 2025, arXiv:2406.10279. https://arxiv.org/abs/2406.10279
National Cyber Security Centre, "Vibe check: AI may replace SaaS (but not for a while)", 24 March 2026. https://www.ncsc.gov.uk/blogs/vibe-check-ai-may-replace-saas-but-not-for-a-while
Forrester, "Predictions 2025: Technology And Security" (press release), 22 October 2024. https://www.forrester.com/press-newsroom/forrester-predictions-2025-tech-security/

Not sure what you are shipping? Our Vibe Code Audit puts senior engineers across your AI-built software and signs off what is safe to ship. Fixed fee, scored review, a clear go or no-go.

Book an audit