Findings

Findings are confirmed bugs, failures, or vulnerabilities that Qodex records with evidence. A failed run should not always become a bug report. Qodex first decides whether the failure is a real product issue, a stale test, or an environment problem.

What happens when a test fails

When a scenario fails or a security probe lands, Qodex does more than show a red X. It analyzes the failure, classifies it, and writes a finding only when the failure looks like a real issue. A finding includes severity, reproduction steps, evidence, and the affected endpoint or page. Qodex deduplicates findings against the existing set so the same bug does not pile up across nightly runs.

Severity model

Five levels, with explicit definitions enforced by the security skill:

Severity	What it means
critical	RCE, SQLi with data access, auth bypass to admin, SSRF to cloud metadata, exposed secrets
high	Stored XSS, IDOR with data exposure, CSRF on account actions, privilege escalation, broken access control
medium	Reflected XSS, CSRF on low-impact actions, info disclosure, missing rate limiting
low	Missing security headers, verbose error messages, cookie flags, clickjacking without sensitive actions
info	Technology disclosure, attack surface notes, deprecated TLS, version numbers

Failure classification

Every failed run goes through src/scanner/failure-analyzer.ts. The classifier reads the failed script, the error and stack, the page screenshot, the DOM snapshot, the original scenario, and the HTTP response. It emits one of three classifications:

Class	Meaning	Action
REAL_BUG	The app broke	Open a finding with severity, evidence, repro
STALE_TEST	Selectors or expectations no longer match	Mark scenario stale, suggest a fix
ENVIRONMENT_ISSUE	Target down, 503, DNS failure	Report as env, not bug

This classifier keeps regression suites usable. Without it, flaky selectors and temporary outages would look like product bugs.

Deduplication

Dedup happens inline inside finding_report. Before persisting a new finding, the tool computes a fingerprint from the affected endpoint or page, an error signature, severity, and category. If a matching open finding exists, the new occurrence is recorded as a re-observation on the existing row rather than a duplicate. The matching logic lives in findOpenByFingerprint and recordFindingReobservation.

Evidence guard

The finding_report tool refuses to file a high or critical security finding unless evidence is present. Specifically, a recent browser_snapshot must follow a failed wait_visible or verify_* call. This is the guard that prevents the agent from inventing severity-inflated findings without proof. The guard runs at report time. It does not write a persisted verified flag onto the finding row.

Status lifecycle

Findings carry one of four statuses:

open  ->  fixed
open  ->  false_positive
open  ->  wontfix

Status moves are recorded with the user who made the change. Triage happens in the Findings page in the web app, in chat, or via the API.

What evidence includes

Every finding ships with:

The exact HTTP request that triggered the failure, redacted
The response that proves the vulnerability or bug
A screenshot (UI) or response snippet (API)
Reproduction steps a human can follow without the agent
For security findings, the OWASP category (for example, A01:2021 Broken Access Control)

When to use it

Promote any agent-classified REAL_BUG to a tracked finding for the team
File a finding when a security scenario fails, since pass means blocked and fail means vulnerable
Triage findings in batch through the Findings page or via the API

When not to use it

STALE_TEST classifications. Those are scenario maintenance, not bugs. Use the scenario triage path instead.
ENVIRONMENT_ISSUE classifications. Surface those to the team that owns the environment.

On the roadmap

Planned: a persisted verified flag that records whether the agent re-ran the repro before reporting, in addition to today’s evidence guard. The verify tool will run a fresh execution of the failing scenario and write the result onto the finding before status moves to open.

Planned: flaky detection per scenario with a rolling 20-run window. When flakiness is above threshold, the classifier biases toward STALE_TEST or ENVIRONMENT_ISSUE instead of REAL_BUG. Held until the first customer crosses about three weeks of regular run volume. See backlog.md.

Planned: pattern analysis across findings. Deterministic clustering by endpoint, page, error fingerprint, and severity. The LLM names the cluster; the clustering itself is code.

Planned: Jira and Linear ticket creation from findings, and SARIF export for GitHub Code Scanning.

Findings reference

The deeper reference and data model.

Failure classification

REAL_BUG vs STALE_TEST vs ENVIRONMENT_ISSUE in depth.

Triage workflow

The status lifecycle and evidence model.

Security testing

Where the inverted-semantics finding rule lives.

Get started

Concepts

PR review

API testing

UI testing

Security testing

Run tests

Findings

Memory

Skills

Integrations

Self-hosted

Account

Findings

Findings

What happens when a test fails

Severity model

Failure classification

Deduplication

Evidence guard

Status lifecycle

What evidence includes

When to use it

When not to use it

On the roadmap

Findings reference

Failure classification

Triage workflow

Security testing

​Findings

​What happens when a test fails

​Severity model

​Failure classification

​Deduplication

​Evidence guard

​Status lifecycle

​What evidence includes

​When to use it

​When not to use it

​On the roadmap

​Related

Findings reference

Failure classification

Triage workflow

Security testing

Findings

What happens when a test fails

Severity model

Failure classification

Deduplication

Evidence guard

Status lifecycle

What evidence includes

When to use it

When not to use it

On the roadmap

Related