Skip to main content

Findings

Findings are confirmed bugs, failures, or vulnerabilities that Qodex records with evidence. A failed run should not always become a bug report. Qodex first decides whether the failure is a real product issue, a stale test, or an environment problem.

What happens when a test fails

When a scenario fails or a security probe lands, Qodex does more than show a red X. It analyzes the failure, classifies it, and writes a finding only when the failure looks like a real issue. A finding includes severity, reproduction steps, evidence, and the affected endpoint or page. Qodex deduplicates findings against the existing set so the same bug does not pile up across nightly runs.

Severity model

Five levels, with explicit definitions enforced by the security skill:
SeverityWhat it means
criticalRCE, SQLi with data access, auth bypass to admin, SSRF to cloud metadata, exposed secrets
highStored XSS, IDOR with data exposure, CSRF on account actions, privilege escalation, broken access control
mediumReflected XSS, CSRF on low-impact actions, info disclosure, missing rate limiting
lowMissing security headers, verbose error messages, cookie flags, clickjacking without sensitive actions
infoTechnology disclosure, attack surface notes, deprecated TLS, version numbers

Failure classification

Every failed run goes through src/scanner/failure-analyzer.ts. The classifier reads the failed script, the error and stack, the page screenshot, the DOM snapshot, the original scenario, and the HTTP response. It emits one of three classifications:
ClassMeaningAction
REAL_BUGThe app brokeOpen a finding with severity, evidence, repro
STALE_TESTSelectors or expectations no longer matchMark scenario stale, suggest a fix
ENVIRONMENT_ISSUETarget down, 503, DNS failureReport as env, not bug
This classifier keeps regression suites usable. Without it, flaky selectors and temporary outages would look like product bugs.

Deduplication

Dedup happens inline inside finding_report. Before persisting a new finding, the tool computes a fingerprint from the affected endpoint or page, an error signature, severity, and category. If a matching open finding exists, the new occurrence is recorded as a re-observation on the existing row rather than a duplicate. The matching logic lives in findOpenByFingerprint and recordFindingReobservation.

Evidence guard

The finding_report tool refuses to file a high or critical security finding unless evidence is present. Specifically, a recent browser_snapshot must follow a failed wait_visible or verify_* call. This is the guard that prevents the agent from inventing severity-inflated findings without proof. The guard runs at report time. It does not write a persisted verified flag onto the finding row.

Status lifecycle

Findings carry one of four statuses:
open  ->  fixed
open  ->  false_positive
open  ->  wontfix
Status moves are recorded with the user who made the change. Triage happens in the Findings page in the web app, in chat, or via the API.

What evidence includes

Every finding ships with:
  • The exact HTTP request that triggered the failure, redacted
  • The response that proves the vulnerability or bug
  • A screenshot (UI) or response snippet (API)
  • Reproduction steps a human can follow without the agent
  • For security findings, the OWASP category (for example, A01:2021 Broken Access Control)

When to use it

  • Promote any agent-classified REAL_BUG to a tracked finding for the team
  • File a finding when a security scenario fails, since pass means blocked and fail means vulnerable
  • Triage findings in batch through the Findings page or via the API

When not to use it

  • STALE_TEST classifications. Those are scenario maintenance, not bugs. Use the scenario triage path instead.
  • ENVIRONMENT_ISSUE classifications. Surface those to the team that owns the environment.

On the roadmap

Planned: a persisted verified flag that records whether the agent re-ran the repro before reporting, in addition to today’s evidence guard. The verify tool will run a fresh execution of the failing scenario and write the result onto the finding before status moves to open.
Planned: flaky detection per scenario with a rolling 20-run window. When flakiness is above threshold, the classifier biases toward STALE_TEST or ENVIRONMENT_ISSUE instead of REAL_BUG. Held until the first customer crosses about three weeks of regular run volume. See backlog.md.
Planned: pattern analysis across findings. Deterministic clustering by endpoint, page, error fingerprint, and severity. The LLM names the cluster; the clustering itself is code.
Planned: Jira and Linear ticket creation from findings, and SARIF export for GitHub Code Scanning.

Findings reference

The deeper reference and data model.

Failure classification

REAL_BUG vs STALE_TEST vs ENVIRONMENT_ISSUE in depth.

Triage workflow

The status lifecycle and evidence model.

Security testing

Where the inverted-semantics finding rule lives.