How to review AI-generated code: a practical checklist

AI code qualityZegaware engineering10 June 202610 min read

Last updated: 10 June 2026

Review AI-generated code the way you would review a junior who is fast, confident and occasionally wrong about things they cannot see. Read every diff. Check authorisation first, then secrets, input handling, test quality, error handling, dependencies and architectural fit. Do not trust a green test suite or a 200 response as proof the code is correct.

Why AI-generated code needs a structured review

Generative tools produce code that runs. Running is not the same as correct, and correct is not the same as safe. Carnegie Mellon's SUSVIBES benchmark found that agent-written code was functionally correct around 61% of the time but secure only 10.5% of the time, and that over 80% of the solutions that actually worked still contained a critical vulnerability [1]. Veracode's Spring 2026 update put it at 45% of AI-generated code introducing at least one OWASP Top 10 vulnerability, a 55% pass rate that has not improved across more than 150 models in two years [2]. Their summary is blunt: "The productivity revolution is here. The security revolution isn't" [2]. CodeRabbit's analysis of real pull requests found AI-authored changes carried roughly 1.7 times more issues than human-authored ones [3].

The pattern matters for how you review. The defects are not random. They cluster in the things a model cannot reason about from a prompt: who is allowed to call this, what the caller might send, and what happens when the call fails. A structured review targets those clusters in order of damage. The checklist below is the review a senior engineer runs, and it is close to the sequence we walk through in what a vibe code audit actually finds.

The review checklist, in priority order

Work top to bottom. The order reflects damage and likelihood: the checks near the top fail loudly and publicly when missed, the checks near the bottom protect maintainability. For each area, the reviewer has one question they are trying to answer about the diff in front of them.

1. Authorisation and access control (check this first)

This is the highest-value check, because the failures are invisible in a demo and catastrophic in production. AI tools are good at the happy path: a logged-in user fetching their own record. They are poor at the boundary: the same endpoint asked for someone else's record. CodeRabbit found insecure direct object references, where a user reads or edits a record simply by changing an identifier, 1.91 times more likely in AI-authored code [3]. This is not theoretical. The McDonald's "McHire" recruitment platform exposed up to 64 million job applications through a single endpoint with an unprotected, sequential record ID [4]. Anyone who could increment a number could read another applicant's data.

What the reviewer looks for: every endpoint, query and mutation that takes an identifier from the caller. For each one, find the line that checks the caller is allowed to act on that specific record, not merely that they are logged in. The question is simple: "If I change this ID to one I do not own, what stops me?" If the answer is "nothing in this diff", the code fails the check. Authentication (who you are) and authorisation (what you may touch) are different concerns, and AI-generated code routinely ships the first while omitting the second.

2. Secrets in source

Read the diff for anything that looks like a key, token, password or connection string. Assistants pull plausible-looking configuration into the file they are editing, and the convenient place to put a value is inline. GitGuardian's State of Secrets Sprawl 2026 found that commits made with Claude Code assistance leaked secrets at 3.2%, more than double the 1.5% baseline for commits generally, and that 28.65 million new secrets were added to public repositories in 2025 [5].

What the reviewer looks for: hard-coded credentials, private keys, internal hostnames and any value that should come from an environment variable or a secret manager. The question is: "If this file were public tomorrow, what would leak?" A secret committed once is compromised even after you delete it, because it remains in the history. Treat any inline credential as a stop-the-line finding: rotate the value, then move it out of source.

3. Input handling and injection

Assume every input is hostile and trace where it goes. Generated code tends to concatenate rather than parameterise, and to render rather than escape. Veracode measured an 85% failure rate on cross-site scripting and 87% on log injection across the models it tested [2]. CodeRabbit found cross-site scripting 2.74 times more likely in AI-authored changes [3]. Injection remains the dominant class because the model writes the shape of the query or template that reads naturally, not the parameterised form that is safe.

What the reviewer looks for: user input reaching a database query, a shell command, an HTML response, a log line or a file path without validation, parameterisation or encoding at the boundary. The question is: "Where does this value cross from data into code, and what makes it safe at that line?" For systems that pass user text into a model, the same discipline applies to prompt injection and insecure output handling, both named in the OWASP Top 10 for LLM Applications [6].

4. Test quality: does the test assert behaviour, or only that the code ran?

A passing test suite is reassuring and frequently misleading. AI tools write tests that exercise the code and then assert weak conditions: that a function returned without throwing, that a response carried a 200 status, that an object was not null. Those tests pass whether or not the logic is correct. Arize's field analysis of production failures describes exactly this failure mode: systems that report "success" because the HTTP status code was 200 while the actual outcome is wrong [7].

What the reviewer looks for: in each test, the assertion. Does it pin the specific value or state the feature is supposed to produce, or does it only confirm the code executed? The question is: "If the implementation were quietly wrong, would this test still pass?" A test that checks for a 200 response, or that a list was returned, will not catch a function returning the wrong list. Coverage measures lines executed, not behaviour verified. Read the assertions, not the percentage.

5. Error handling and silent failures

Look for the catch that swallows. Generated code often wraps risky operations in handlers that log nothing, return a default, or report success regardless of what happened. The result is a system that fails quietly: the database write was rejected, the external call timed out, the validation was skipped, and the caller is told everything is fine. This is the same 200-while-wrong pattern, and it is where incidents hide until they become expensive [7].

What the reviewer looks for: empty catch blocks, errors logged and then ignored, default return values that mask a failed operation, and success responses sent before the work is confirmed. The question is: "When this fails, who finds out, and how?" If the answer is "nobody, until a customer complains", the error handling fails the check. Failures should be loud, attributable and acted upon, not absorbed.

6. Dependencies and hallucinated packages

Check every import you do not recognise against the real registry. Models invent packages that do not exist. Spracklen and colleagues, in work presented at USENIX Security 2025, found that 19.7% of AI-generated code samples referenced a non-existent package, and that 43% of those hallucinated names recurred across runs [8]. The recurrence is the danger: a predictable fake name is one an attacker can register and fill with malicious code, so the next developer who accepts the suggestion installs it. This is the supply-chain risk now called slopsquatting.

What the reviewer looks for: imported packages that are unfamiliar, misspelled, or close to a well-known name. For each, confirm it exists, is the package you intended, is actively maintained, and carries a licence you can use. The question is: "Did a human choose this dependency, or did the model?" New third-party code is the largest attack surface most changes introduce, and it deserves a deliberate decision rather than an autocomplete.

7. Architectural consistency

Read the change against the codebase around it. AI tools optimise for a working answer to the prompt in front of them, with no memory of your conventions. The Naples study found AI-generated code tends to be simpler but more repetitive, and more prone to unused constructs such as dead variables and imports [9]. In a review you see the symptoms: a second way of doing database access, a re-implemented helper that already exists, a pattern that contradicts the rest of the module.

What the reviewer looks for: duplication of existing utilities, inconsistent error or logging conventions, a new data-access or validation approach that competes with the established one, and leftover scaffolding. The question is: "Does this look like it was written by someone who had read the rest of the codebase?" Consistency is not cosmetic. Every divergent pattern is one more thing the next engineer must learn, and one more place where the two approaches will eventually disagree.

Running the checklist

The order is deliberate. Authorisation and secrets are the findings that make headlines, so they come first and are non-negotiable. Input handling, tests and error handling are where most real defects live. Dependencies and architecture protect the next six months. None of this requires you to distrust the tools or stop using them. It requires you to read what they produce with the assumption that it is plausible, fast and unverified. For the wider question of when that verified code is ready for production, see is AI-generated code safe to ship?.

Frequently asked questions

Can I just use an AI code-review tool?

Use one, but do not rely on it alone. Automated reviewers catch known patterns at scale: hard-coded secrets, some injection sinks, common smells. They cannot make the judgement a senior review makes, because they do not know your authorisation model, your data sensitivity or your conventions. Veracode found no security improvement across 150-plus models in two years [2]. Tools narrow the search; a person still decides what matters.

What is the most important thing to check?

Authorisation. Confirm that every operation taking a record identifier checks the caller is allowed to act on that specific record, not merely that they are logged in. This single class of defect, the insecure direct object reference, exposed up to 64 million applications in the McDonald's McHire breach through one unprotected sequential ID [4]. It is invisible in a demo and total in its impact.

How is this different from a normal code review?

The mechanics are the same; the assumptions invert. In a human review you assume competence and look for mistakes. In an AI review you assume fluent, confident output that may have no understanding behind it, so you verify intent, not just correctness. You read every line as written by someone who never saw your codebase or your threat model. The defects cluster differently, so you look in different places first.

Does this mean we should stop letting AI write code?

No. The productivity is real and the tools are not going away. The point is that speed of generation has outrun verification, so the human role shifts from typing to reviewing. Used with a disciplined review in place, AI assistance is a net gain. Used as a substitute for review, it ships the 80% of working solutions that still carry a critical vulnerability [1].

Get a structured review of your AI-generated code

If your team is shipping AI-assisted code faster than it can review it, a structured outside review finds the authorisation gaps, exposed secrets and silent failures before your users or an attacker do. The Vibe Code Audit applies this checklist to your codebase and returns prioritised, evidence-backed findings. To arrange one, see the Vibe Code Audit.

Sources

Songwen Zhao et al., "Is Vibe Coding Safe? Benchmarking Vulnerability of Agent-Generated Code in Real-World Tasks" (SUSVIBES benchmark), Carnegie Mellon University, arXiv:2512.03262, 2026. https://arxiv.org/abs/2512.03262
Veracode, Spring 2026 GenAI Code Security Update, 24 March 2026. https://www.veracode.com/blog/spring-2026-genai-code-security/
CodeRabbit, State of AI vs Human Code Generation Report, 17 December 2025. https://www.coderabbit.ai/blog/state-of-ai-vs-human-code-generation-report
Ian Carroll and Sam Curry, write-up of the McDonald's "McHire" IDOR exposing up to 64 million job applications, 2025. https://ian.sh/mcdonalds
GitGuardian, The State of Secrets Sprawl 2026, 17 March 2026. https://blog.gitguardian.com/the-state-of-secrets-sprawl-2026/
OWASP, Top 10 for LLM Applications 2025. https://genai.owasp.org/resource/owasp-top-10-for-llm-applications-2025/
Aryan Kargwal, "Why AI Agents Break: A Field Analysis of Production Failures", Arize AI, 29 January 2026. https://arize.com/blog/common-ai-agent-failures/
Joseph Spracklen et al., "We Have a Package for You! A Comprehensive Analysis of Package Hallucinations by Code Generating LLMs", USENIX Security 2025, arXiv:2406.10279. https://arxiv.org/abs/2406.10279
Domenico Cotroneo et al., "Human-Written vs. AI-Generated Code: A Large-Scale Study of Defects, Vulnerabilities, and Complexity", arXiv:2508.21634, August 2025. https://arxiv.org/abs/2508.21634

Not sure what you are shipping? Our Vibe Code Audit puts senior engineers across your AI-built software and signs off what is safe to ship. Fixed fee, scored review, a clear go or no-go.

Book an audit