AI Code Review Grounded in Executed Tests

What Is AI Code Review?
AI code review is the use of a large language model to analyze a pull request and flag bugs, security issues, and design problems before a human reviewer reads the code. A good AI reviewer reads the diff in the context of the codebase, comments inline on the lines that matter, and leaves the judgment calls to humans.
That is the definition. The interesting question in 2026 is not whether AI can comment on code, every tool can. It is whether the review is grounded in anything beyond the model's prediction. Most AI reviewers read your diff and guess. This guide covers how AI code review works, why the static-diff-only approach hits a ceiling, and how Qodex grounds review in your actual codebase and verification against your running app.
Why Most AI Reviewers Guess
The standard AI code review pipeline looks like this: the tool receives a webhook when a pull request opens, fetches the diff, sends it to an LLM with some surrounding context, and posts whatever the model says as review comments.
This works surprisingly well for a class of problems: typos, obvious null-handling mistakes, style drift, missing error handling. But it has a structural limit: the model never executes anything. It predicts, from the text of the patch, whether the code is correct. That leads to three familiar failure modes:
Confident false positives. The model flags a "bug" that the surrounding code already handles, because the relevant context was not in the prompt window.
Unverifiable claims. "This may cause a race condition" is a hypothesis, not a finding. Nobody on the team can act on it without doing the investigation themselves, which defeats the point.
Noise fatigue. Once a reviewer posts enough low-confidence comments, developers stop reading them. The review becomes a ritual instead of a gate.
The fix is not a better prompt. It is grounding: give the reviewer real knowledge of the codebase, and give it a way to check its claims against the running application.
How Qodex Grounds Code Review
Qodex is an autonomous AI QA platform, so its PR review sits on top of a system that already knows how to explore an application, generate test scenarios, and execute them. The review is one GitHub App install away, and every piece below is shipped today.
The reviewer knows your codebase, not just your diff
When you connect the Qodex GitHub App, you can link your repositories (up to 10 per project, so frontend, backend, and services can all contribute context). Qodex clones each repo once, runs a deterministic analyzer over it, and then deletes the clone. What it keeps is structured knowledge: the actual route table, the auth wiring, form validation schemas, ORM models, and the test framework in use. During a review, the agent can also read specific files and search the codebase live through the GitHub API, with secret redaction applied to everything it reads.
This matters because most false positives in diff-only review come from missing context. A reviewer that can look up the real route handler or the real validation schema does not have to guess what the rest of the codebase does.
Findings are filtered before they reach you
Every PR review runs the diff through the LLM, then filters the findings: low-confidence findings are dropped, findings below your configured severity threshold are dropped, and excluded paths (generated code, vendored files) are skipped. Findings you have previously dismissed in that project are filtered too. Inline comments anchor to the exact diff lines, and a guard strips code suggestions that would anchor to comment or import lines where an auto-fix would be wrong, marking them as anchor-uncertain instead.
Verification probes check claims against your preview deployment
This is where Qodex departs from the static-diff pack. When a PR has a preview deployment (discovered through the GitHub Deployments API), Qodex runs verification probes: safe, GET-only HTTP requests against the preview environment to test whether a flagged issue is actually observable. The probe layer is SSRF-guarded and restricted to the deployment's host. A finding that a probe confirms carries evidence, not speculation.
A Check Run that can gate the merge
Every review posts a GitHub Check Run: in-progress when the review starts, then success, neutral, or failure when it completes. By default the check never blocks anything. If you enable block_pr_merge in your config, the check fails only when a verified finding meets your configured blocking severity (default: critical). That is a deliberate design: an unverified LLM hypothesis should never hold up your merge.
Configured per repo with .qodex.yaml
Review behavior lives in a .qodex.yaml file at the repo root, read from the PR's head commit so config changes apply in the same PR that makes them:
pr_review:
enabled: true
severity_threshold: medium
block_pr_merge: false
block_on_severity: critical
paths:
exclude:
- "dist/**"
- "**/*.generated.ts"
Driven from the PR with @qodex commands
Maintainers can talk to the reviewer in the PR itself: @qodex review re-runs the review on demand, and @qodex help lists available commands. Commands are gated by author association, so only owners, members, and collaborators can trigger them.
Where this is headed: running your tests against the diff
Qodex's core platform generates runnable test scenarios and replays them deterministically at zero LLM cost. The PR review is built to converge with that: map the changed routes and handlers in a diff to the test scenarios that cover them, run those scenarios against the PR's preview deployment, and post pass/fail results as review evidence. That is the end state the architecture points at: a review where "does this change break anything" is answered by executed tests, not by a model's guess.
AI Code Review Tools Compared
An honest comparison. All three tools below are good at what they were built for.
| GitHub Copilot Code Review | CodeRabbit | Qodex | |
|---|---|---|---|
| How it reviews | LLM review of the diff, native in GitHub | LLM review with PR summaries, walkthroughs, and chat | LLM review grounded in analyzed repo knowledge |
| Codebase knowledge | Repository context within GitHub | Indexes the repo for context | Deterministic analysis: routes, auth, schemas, ORM models |
| Execution evidence | None, static review | None, static review | GET-only verification probes against the PR's preview deployment |
| Merge gating | Via required reviews | Pre-merge checks | GitHub Check Run; blocks only on verified findings at your severity bar |
| Config | GitHub settings | YAML config | .qodex.yaml per repo |
| Pricing model | Included with paid Copilot plans | Free tier; Pro from $24/user/month billed annually | Free tier; paid plans via sales |
GitHub Copilot code review is the lowest-friction option if your team already pays for Copilot. It lives natively in github.com, requires no third-party install, and is a solid first pass for style and obvious bugs. Its reviews are static and its depth depends on what fits in context.
CodeRabbit is the most polished of the dedicated review startups. Its PR summaries and walkthroughs are genuinely useful for orienting human reviewers, it supports conversational follow-ups in the PR, and it learns from your feedback. It is free for public repositories. Like Copilot, its findings are predictions from the diff and surrounding context, not verified behavior.
Qodex is the right choice when you care about evidence. It is the only one of the three that probes the running preview deployment to verify findings, and the only one attached to a full API testing platform, so the path from "this diff looks risky" to "here is the failing test scenario" is one product, not two.
How to Set Up AI Code Review with Qodex
Connect GitHub. Sign in to Qodex and install the GitHub App on the repositories you want reviewed.
Link your repos. Grant the install to your Qodex project and link the repositories. Qodex analyzes each one and builds its knowledge of your routes, auth, and schemas.
Add .qodex.yaml (optional). Reviews work with defaults out of the box. Add the config file when you want to tune severity thresholds, exclude paths, or enable merge blocking.
Open a pull request. Qodex posts a walkthrough, inline findings on the diff, and a Check Run with the outcome. If a preview deployment exists, verified findings carry probe evidence.
Tune from the PR. Use
@qodex reviewto re-run after pushes, and dismiss findings you disagree with; dismissed findings are filtered from future reviews in that project.
Best Practices: AI Plus Human Review
Let the AI go first, not last. Run the AI review on PR open so human reviewers read code that has already been swept for mechanical issues, and spend their attention on design and intent.
Demand evidence for severity. Treat unverified "critical" claims from any AI tool as hypotheses. Block merges only on findings that are verified or human-confirmed.
Tune the threshold down, not the trust up. If the reviewer is noisy, raise the severity threshold and exclude generated paths. A quiet reviewer that is right keeps its audience.
Keep humans on the why. AI is strong on "this line is wrong." Humans are strong on "this approach is wrong." A review process that uses both beats either alone.
Ready to see review comments backed by evidence instead of guesses? Connect your repo to Qodex and open a pull request.
Frequently Asked Questions
What is AI code review?
AI code review uses a large language model to analyze pull requests and flag bugs, security issues, and design problems before or alongside human review. The tool reads the diff, posts inline comments on specific lines, and summarizes the change. The best implementations ground their analysis in knowledge of the full codebase and verify findings against a running environment rather than relying on the model's prediction alone.
Can AI replace human code review?
No, and it should not. AI reviewers are excellent at mechanical issues: null handling, missing validation, security anti-patterns, inconsistencies with the rest of the codebase. They are weak at judging intent, architecture, and product trade-offs. The effective setup is layered: the AI sweeps every PR first, and humans review with that noise already cleared.
How is Qodex different from CodeRabbit or GitHub Copilot code review?
Copilot and CodeRabbit review the diff statically: the model predicts problems from the patch and repository context. Qodex adds two layers of grounding. First, it analyzes your linked repositories into structured knowledge (routes, auth wiring, validation schemas, ORM models) that informs every review. Second, it runs safe, GET-only verification probes against the PR's preview deployment, so findings can carry observed evidence instead of speculation. Its Check Run only blocks merges on verified findings.
Does AI code review work with private repositories?
Yes. Qodex reviews private repositories through a GitHub App install scoped to the repositories you choose. Repo analysis uses a one-time shallow clone that is deleted after processing, file reads go through the GitHub API, and everything the agent reads passes through secret redaction that strips tokens, keys, and credentials before any content reaches the model.
How do I control what the AI comments on?
With Qodex, a .qodex.yaml file in your repo root sets the severity threshold, excludes paths like generated or vendored code, and controls whether the Check Run can block merges. Config is read from the PR's head commit, so you can change review behavior in the same PR. Findings you dismiss are remembered and filtered from future reviews in that project.
Can AI code review block a merge?
It can, but it should do so carefully. Qodex posts a GitHub Check Run on every review, and by default it never blocks. If you enable block_pr_merge, the check fails only when a finding is verified by a probe and meets your configured blocking severity, which defaults to critical. Unverified model output never gates your merge.
Ship continuously. Test continuously.
Qodex explores your app, writes runnable tests, and replays them on every change at zero LLM cost.
Related Blogs





