What a vibe code audit actually finds

AI code qualityZegaware engineering4 June 20269 min read

Last updated: 4 June 2026

A vibe code audit is a senior engineer's structured review of AI-built software before it ships. The recurring findings are remarkably consistent across projects: exposed secrets, injection flaws, missing authorisation, shallow tests, silent failures, architectural drift, and hallucinated dependencies. The code almost always runs. Whether it is safe to ship is a separate question, and that is the one an audit answers.

We review a lot of AI-built software at Zegaware. The codebases arrive from funded founders who shipped a product in a weekend, from enterprise teams who let an agent loose on an internal tool, and from buyers part-way through an acquisition who want to know what they are actually purchasing. The brief is always the same: it works in the demo, but is it safe to put in front of customers? This article is an honest account of what we find when we look.

The gap between "it works" and "it is safe to ship"

The single most useful number we know comes from a Carnegie Mellon study of AI coding agents. Across two hundred real programming tasks, the agents produced functionally correct code about 61% of the time, but only 10.5% of their solutions were also secure [1]. That fifty-point gap is the whole problem in one statistic. Producing software that runs and producing software that is safe are largely independent outcomes, and the demo only ever tests the first.

The wider data agrees. Veracode's testing across more than 150 models found that 45% of AI-generated code introduces at least one vulnerability from the OWASP Top 10, a figure that did not improve across two further years of frontier model releases [2]. As Veracode put it: "The productivity revolution is here. The security revolution isn't" [2]. CodeRabbit's analysis of pull requests reached the same place from a different angle: AI-authored changes carried roughly 1.7 times more issues than human-authored ones, with cross-site scripting 2.74 times more likely and insecure object references 1.91 times more likely [3].

None of this means AI-built software is worthless. It means the output needs the same senior judgement that any junior engineer's first draft would need, applied deliberately, before anyone relies on it. Here is what that judgement surfaces, in the order we tend to find it.

Secrets committed in the open

The most common finding, and the easiest to exploit, is credentials in the source. AI models learned to write working code from training data full of inline API keys and connection strings, so they reproduce the pattern. GitGuardian found that Claude Code-assisted commits leak secrets at 3.2%, more than double the 1.5% baseline across all public commits, and recorded 28.65 million new secrets pushed to public repositories in a single year [4]. The exposure is rarely brief: a set of AWS keys embedded in one organisation's website source sat there for nearly two years before anyone noticed [5].

Injection, wherever data flows

AI-generated code reliably fails on injection. It concatenates strings into SQL, writes unescaped output into HTML templates, and passes user input straight to the shell. Veracode measured an 85% failure rate on cross-site scripting tasks and 87% on log injection [2]. For any product that now includes an AI feature, prompt injection is an additional surface, where text from an email or a web page can override the system instructions an agent was given. As Mackenzie Jackson, a developer advocate at Aikido Security, puts it, once AI gets something to work it tends to forget parts of what it was meant to do, including sanitising the code [6].

Authorisation the model never modelled

This is the finding that worries us most, because it is invisible in testing. An AI model answers the prompt in front of it, "make this endpoint return the user's orders," without ever modelling the question a senior engineer asks reflexively: which user is allowed to see which orders? The result is broken access control and insecure direct object references, where changing an ID in a request returns someone else's data. It is not theoretical. One fast-food recruitment endpoint accepted a record ID with no ownership check and exposed up to sixty-four million job applications to simple sequential enumeration [7]. AI code is measurably more prone to this class of flaw than human code [3].

Tests that assert nothing

AI tools do write tests, and that is part of the trap. The tests tend to mirror the implementation and cover the happy path, so they pass while proving very little. A suite that confirms "the function ran without throwing" is not the same as one that confirms the function refuses an unauthorised request, rejects malformed input, or handles the third-party timeout. A senior review separates tests that assert correct behaviour from tests that merely assert the code executed, and in vibe-coded projects the second kind dominates.

Errors that fail silently

AI-generated code characteristically catches an exception and then swallows it, or logs a generic line and returns as if nothing happened. The production consequence is a system that returns a success status while quietly corrupting data or dropping work. Analysis of AI agent failures highlights this exact pattern: the system passes its tests and reports "success" because the HTTP status code was 200, while the actual outcome is wrong [8]. Traditional monitoring does not catch this, because by every signal it watches, the system looks healthy.

A codebase that forgot its own patterns

Each AI session generates its feature from scratch rather than extending what already exists. Over a few weeks this produces three different ways to check a user's role, four different database access patterns, and two incompatible approaches to handling dates. Every individual piece works. The system as a whole becomes progressively harder to change safely, because there is no single place to fix anything. Independent analysis has noted that AI-generated code tends to be simpler but more repetitive, and more prone to unused constructs [9]. Architectural drift is the finding that does the most long-term damage and shows up the least in a demo.

Dependencies that do not exist

AI models hallucinate package names. They confidently recommend a library that no registry has ever held, or blend two real names into a plausible fiction. Researchers found that 19.7% of AI-generated code samples referenced a non-existent package, and that 43% of those fabricated names recurred across repeated runs [10]. That repeatability is the danger: attackers register the predictable hallucinated names on npm and PyPI and wait for an unreviewed install. The practice now has a name, slopsquatting, and confirmed malicious packages have already reached tens of thousands of downloads [11].

Performance that collapses under load

The classic AI data-access pattern is the N+1 query: one query to fetch a list, then one more query per item to fill in its details. The code is clear, correct, and fine with ten records in development. It falls over with ten thousand in production. Performance issues appear roughly 1.4 times more often in AI-authored changes than human ones [3], and they are easy to miss precisely because they are not bugs in the usual sense. The logic is right. The behaviour at scale is not.

Software you cannot see into

Finally, AI-generated code is written to pass acceptance criteria, not to be operated. Structured logging, tracing, metrics, and health checks are absent unless someone asks for them by name. When an incident happens, there is nothing to look at. In the widely reported Replit case, an agent ran a destructive database command during a code freeze and then generated thousands of fictional records, against explicit instructions to make no changes [12]. Without structured logging and an audit trail, that is the kind of failure no one can see coming or reconstruct afterwards. You cannot run a service in production that you cannot see into.

What "safe to ship" actually means

"Safe to ship" is not a vibe. It is a checklist a named senior engineer is willing to sign. We anchor ours in recognised standards rather than opinion: the OWASP Top 10 for LLM Applications for any AI feature [13], the OWASP Application Security Verification Standard for the conventional surface, and a tiered view of risk that decides what an AI tool may draft and what a human must write from scratch. Authentication, authorisation, cryptography, payments, data migrations, and secrets handling sit in the red lane: an AI model may sketch them, but no one ships them without a human rewrite and a security review. The UK's National Cyber Security Centre makes the same point, calling for "deterministic controls to constrain AI-generated code, even when it is malicious or flawed" [6].

The audit output is not a list of complaints. It is a decision: what is safe to ship now, what needs work first, and what should not have been built with an agent at all. That decision carries a name and accountability, which is the entire point.

Why this matters for founders and CTOs

For a funded founder, this is due diligence risk. Technical due diligence has become a standard, and often decisive, part of UK funding rounds, and a codebase with committed secrets, no failure-path tests, and several implementations of the same logic is exactly what a technical reviewer finds in an afternoon. Security failures erode investor confidence faster than almost anything else, and the cost of fixing them after a term sheet is far higher than a review beforehand.

For an enterprise CTO, it is liability. UK GDPR expects privacy and appropriate technical measures to be built in, not bolted on [15], so AI code that exposes personal data through a missing authorisation check is a direct regulatory exposure. Financial firms bound by operational resilience rules cannot meet them with software that has no observability. Forrester predicts that 75% of technology decision-makers will see their technical debt rise to a moderate or high level of severity by 2026, with the rapid development of AI solutions named as an accelerant [14]. A review is cheaper than the cleanup, and far cheaper than the breach.

Frequently asked questions

Is a vibe code audit the same as a security audit?

No. A security audit looks for vulnerabilities. A vibe code audit is broader: it assesses whether AI-built software is safe to ship, covering security alongside architecture, test quality, error handling, dependencies, performance, and operability. Security is one part of the verdict, not the whole of it. The output is a sign-off, not only a vulnerability list.

Does this mean we should stop using AI to write code?

No. AI coding tools genuinely accelerate delivery, and that is worth keeping. The point is that their output needs the same senior review any first draft requires, applied before customers rely on it. Used with a clear review gate, AI-built software ships safely. Used without one, it ships the recurring problems this article describes.

How long does a vibe code audit take?

It depends on the size of the codebase and how much AI-generated code it contains, but most reviews run from a few days to a couple of weeks. A focused product built quickly can be assessed faster than a large system with months of accumulated drift. We scope each audit before starting, so you know the timeline and the cost before any work begins.

What do we get at the end?

A clear written verdict from a named senior engineer: what is safe to ship now, what must be fixed first ranked by severity, and what should be rebuilt rather than patched. Each finding is specific, located in your code, and explained in plain terms, so your team can act on it without needing to interpret a tool's raw output.

We built our product with AI. Should we be worried?

Not worried, but you should look before you scale. If your software is in front of paying customers, or about to be, the findings here are common enough that an honest review is worth the time. The goal is not to criticise how it was built. It is to know exactly where it stands before more depends on it.

The verdict is the deliverable

AI lets a small team build something real in a fraction of the time it used to take. That is a genuine advance, and we are not here to talk anyone out of it. We are here to answer the one question the demo cannot: is it safe to ship? If you have built something with AI and you want a senior engineer to tell you honestly where it stands, book a Vibe Code Audit. We will tell you what we find, and we will put our name to it.

Sources

Songwen Zhao et al., "Is Vibe Coding Safe? Benchmarking Vulnerability of Agent-Generated Code in Real-World Tasks" (SUSVIBES benchmark), Carnegie Mellon University, arXiv:2512.03262, 2026. https://arxiv.org/abs/2512.03262
Veracode, Spring 2026 GenAI Code Security Update, 24 March 2026. https://www.veracode.com/blog/spring-2026-genai-code-security/
CodeRabbit, State of AI vs Human Code Generation Report, 17 December 2025. https://www.coderabbit.ai/blog/state-of-ai-vs-human-code-generation-report (per-category figures in the gated whitepaper: https://www.coderabbit.ai/whitepapers/state-of-AI-vs-human-code-generation-report)
GitGuardian, The State of Secrets Sprawl 2026, 17 March 2026. https://blog.gitguardian.com/the-state-of-secrets-sprawl-2026/
Cloud Security Alliance, "Vibe Coding Security Crisis: Credential Sprawl and SDLC Debt", 31 March 2026. https://labs.cloudsecurityalliance.org/research/csa-research-note-ai-generated-code-security-vibe-coding-202/
Kevin Poireault, "How Security Leaders Can Safeguard Against Vibe Coding Security Risks", Infosecurity Magazine, 6 April 2026. https://www.infosecurity-magazine.com/news-features/how-safeguard-vibe-coding-security/
Ian Carroll and Sam Curry, write-up of the McDonald's "McHire" IDOR exposing up to 64 million job applications, 2025. https://ian.sh/mcdonalds
Aryan Kargwal, "Why AI Agents Break: A Field Analysis of Production Failures", Arize AI, 29 January 2026. https://arize.com/blog/common-ai-agent-failures/
Domenico Cotroneo et al., "Human-Written vs. AI-Generated Code: A Large-Scale Study of Defects, Vulnerabilities, and Complexity", arXiv:2508.21634, August 2025. https://arxiv.org/abs/2508.21634
Joseph Spracklen et al., "We Have a Package for You! A Comprehensive Analysis of Package Hallucinations by Code Generating LLMs", USENIX Security 2025, arXiv:2406.10279. https://arxiv.org/abs/2406.10279
Cloud Security Alliance, "Slopsquatting: AI Code Hallucinations Fuel Supply Chain Attacks", 19 April 2026. https://labs.cloudsecurityalliance.org/research/csa-research-note-slopsquatting-ai-supply-chain-20260419-csa/
Beatrice Nolan, Fortune report on the Replit AI agent that deleted a production database during a code freeze, 23 July 2025. https://fortune.com/2025/07/23/ai-coding-tool-replit-wiped-database-called-it-a-catastrophic-failure/
OWASP, Top 10 for LLM Applications 2025. https://genai.owasp.org/resource/owasp-top-10-for-llm-applications-2025/
Forrester, "Predictions 2025: Technology And Security" (press release), 22 October 2024. https://www.forrester.com/press-newsroom/forrester-predictions-2025-tech-security/
UK GDPR, Article 25 (Data protection by design and by default). https://www.legislation.gov.uk/eur/2016/679/article/25

Not sure what you are shipping? Our Vibe Code Audit puts senior engineers across your AI-built software and signs off what is safe to ship. Fixed fee, scored review, a clear go or no-go.

Book an audit