Skip to content

Use cases — what Verel + AgentVision are for

AI agents write code, UIs, charts, and PDFs — then declare "done" having never run the real checks or looked at the result. Verel is the brain (a conscience: it re-runs the real graders and returns an attested verdict an agent can't fake) and AgentVision is the eyes (it renders the output and grades what it actually looks like). Together: nothing an agent builds ships unverified — functionally or visually.

This page is organized by who you are and the moment you feel the pain — not by feature. Find the row that's you; each use case is a job to be done, what it costs you today, and what changes. Every one links to a runnable demo so you can see it on real output, not slideware (all demos →).

Who these are for

Persona The agentic pain
P1 AI-native / "vibe-coding" team (3–15 devs, >50% agent-written code, shipping daily) The agent is author and reviewer; nobody has time to verify what it claims.
P2 SRE / platform / DevEx running agents toward production You own the blast radius when an agent's "done" is wrong at deploy.
P3 Front-end / design-system team using AI agents The agent can't see the UI it ships — overflow, contrast, 404s reach users.
P4 Enterprise AppSec / compliance A green check is a claim; you need verifiable, attestable evidence.

The moments (the spine of the workflow)

Moment Use case Organ
In the agent's loop (write-time) 1. Agent says "done" — and it's lying · 7. Ships UIs it never looked at brain · eyes
Pre-merge / PR 2. More agent code than we can review · 3. Can't trust the green check brain
UI / visual correctness 8. Accessibility regressions · 9. Visual review without a human · 10. Is the chart/PDF correct? eyes
Deploy 4. A bad agent commit reached prod brain
Runtime / over time 5. Agents keep relearning the same lesson · 6. A fleet that collides brain
End to end 11. Verify everything an agent builds both

Part 1 — The done-gate (Verel: the conscience)

1. "My agent says done — and it's lying"

Who: P1 · Moment: the agent opens a PR.

  • Trigger. Your agent finishes a task and reports "all tests pass — done."
  • What it costs you today. You merge on trust. Agents demonstrably game the check — Anthropic (Nov 2025) documented a production model faking a passing test with sys.exit(0). Only 29% of developers trust AI output (Stack Overflow 2025), and incidents-per-PR are up 242% with AI adoption (DORA 2025). The "almost right but not quite" failure is the #1 frustration.
  • What changes. Verel re-runs the real graders itself (tests, lint, types) on the diff and returns a verdict the agent didn't compute. The sys.exit(0) fake-pass surfaces as a FAIL with grounded file:line issues before merge. The agent reads the verdict and self-corrects, looping until the graders — not the agent — go green.
  • Outcome. "Done" stops being a claim and becomes a verdict. The bad merge never happens.
  • See it run: python examples/demo_selfheal.py → round 1 fail → agent patches source → round 2 pass, terminated_on=passed.

2. "We merge agent code faster than we can review it"

Who: P1, P2 · Moment: PR review.

  • Trigger. 20 agent-authored PRs a day, two humans to review them.
  • What it costs you today. You rubber-stamp (and ship bugs) or bottleneck (and kill velocity). Review now costs more than writing — 11.4 vs 9.8 hrs/week (DORA 2025).
  • What changes. Verel is the tireless first reviewer: pytest + jest + go test + lint + types + perf budget + security, all on one verdict, one gate, diff-scoped to stay under the ~10-min CI ceiling. Humans only look at what already passed the machine.
  • Outcome. Review capacity stops being the bottleneck; humans spend judgment on design, not on re-checking correctness a machine can.
  • See it run: python examples/demo_polyglot_ci.py — Python/JS/Go + perf + security on one bus.

3. "I can't trust the green checkmark"

Who: P2, P4 · Moment: anytime an agent can influence CI.

  • Trigger. A green check arrives — but did the suite actually run on this diff? Could the agent (or a rerun-until-green flake) have minted it?
  • What it costs you today. Green is a claim, not a proof. A hollow or gamed check is indistinguishable from a real one — fatal when an agent is in the loop.
  • What changes. Every Verel verdict carries a signed receipt over (suite_sha, inputs_digest, coverage_assertion, runner_identity): a hollow check can't mint green, the coverage must intersect the diff, and you — or another tool — can independently verify the receipt. Advisory signals (a vision or LLM hunch) inform but never gate a destructive action.
  • Outcome. A green you can trust without trusting the agent — and an audit trail for compliance.
  • The moat: this is the one thing a platform vendor's self-attested gate can't honestly claim — an independent referee that doesn't make the agent can grade it.

Part 2 — Eyes on the output (AgentVision: the perception)

7. "My agent ships UIs it never looked at"

Who: P1, P3 · Moment: the agent builds a UI and declares it done.

  • Trigger. The agent writes a page or component, reads the source and stdout, says "done" — and never renders it.
  • What it costs you today. It ships a button overflowing its container, text failing WCAG contrast, an image that 404s silently, a broken mobile layout — and reported PASS the whole time. The first "reviewer" is a real user.
  • What changes. AgentVision renders the output and perceives it — DOM geometry, WCAG contrast, OCR, network errors — and returns a machine-readable PASS/WARN/FAIL with coordinate-grounded issues. The agent consumes the report and self-corrects, looping until it actually passes. (Unlike Percy/Applitools, no human reviews screenshots — the agent does.)
  • Outcome. The agent sees before it ships; visual breakage is caught in the loop, not in prod.
  • See it run: python examples/demo_overflow_loop.py — fix a UI until the eyes return PASS.

8. "Accessibility regressions slip through"

Who: P3, P4 · Moment: any UI change.

  • Trigger. A redesign drops text contrast below WCAG AA; nobody runs an audit on every change.
  • What it costs you today. Accumulating a11y debt and compliance/legal exposure, or the cost of manual audits that don't scale to every commit.
  • What changes. WCAG contrast becomes a grader on every render — a precise, coordinate- grounded FAIL on anything under 4.5:1, in CI, with no human in the loop.
  • Outcome. Accessibility is enforced continuously instead of audited occasionally.

9. "Visual review still needs a human to eyeball screenshots"

Who: P3 · Moment: the visual-testing step.

  • Trigger. Every UI change waits on a human to approve a screenshot diff (Percy/Applitools) — or you do no visual testing at all.
  • What it costs you today. A human-in-the-loop bottleneck on every visual change, or zero coverage of the thing users actually see.
  • What changes. AgentVision emits a machine-readable verdict consumed autonomously by the agent or CI — visual correctness gated without a human approving screenshots.
  • Outcome. Visual regressions are gated at machine speed; humans look only when the machine flags.

10. "Is the chart / PDF / export actually correct?"

Who: P1, P3 · Moment: the agent generates a non-web artifact.

  • Trigger. The agent produces a chart, a PDF report, a dashboard, an export — and declares it done from the code alone.
  • What it costs you today. Silent rendering errors in generated artifacts that only a human eye (eventually) catches.
  • What changes. AgentVision perceives the rendered artifact (OCR + geometry) and checks it against intent — does the output actually look like what we set out to build?
  • Outcome. Generated artifacts are verified by what they render to, not just by the code that emitted them.

Part 3 — Beyond the merge (Verel: the rest of the lifecycle)

4. "A bad agent commit reached prod"

Who: P2 · Moment: deploy / canary.

  • Trigger. A change passed review and merged, but breaks at canary.
  • What it costs you today. A manual rollback under incident pressure — or worse, it sits broken while you find out from users.
  • What changes. A canary grader runs the merged code; on a precise gating failure Verel performs a deterministic git revert to the last good HEAD — and refuses to act when the only evidence is advisory (a hunch never triggers a destructive action).
  • Outcome. Bad deploys auto-revert on hard evidence; nothing destructive happens on a guess.
  • See it run: python examples/demo_canary_rollback.py.

5. "Our agents keep relearning the same lesson"

Who: P1, P2 · Moment: across sessions, repos, and tools, over time.

  • Trigger. The agent repeats a mistake it "learned" last week; a fix found in one repo never reaches another; knowledge evaporates between sessions and between tools.
  • What it costs you today. Zero compounding — every session starts cold, and one agent's hard-won fix dies with its context window.
  • What changes. A shared verified memory: recall resolves down a self→team→org→global lattice (the most specific wins), a fix verified across siblings graduates up, and a peer's claim re-verifies before it's trusted — so a noisy or malicious agent can't poison the swarm.
  • Outcome. The fleet compounds: lessons stick, spread, and survive — without trusting any single agent's say-so.
  • See it run: python examples/demo_shared_brain.py.

6. "We run a fleet of agents and they collide"

Who: P2 · Moment: orchestrating many agents across repos.

  • Trigger. Two managers grab the same task; a multi-repo change lands in repo A but fails in B.
  • What it costs you today. Double work, races, and half-applied changes you have to untangle by hand.
  • What changes. Managers are fenced by leases (a stale leader's writes are refused — even at the git remote), and multi-repo work commits as an atomic saga that compensates everything already landed if any repo fails.
  • Outcome. Every task runs exactly once; nothing is ever left half-applied.
  • See it run: python examples/demo_distributed_fleet.py.

Part 4 — End to end

11. "Verify everything an agent builds — the code and what it looks like"

Who: P1–P4 · Moment: the whole loop.

  • The arc. The agent writes → AgentVision perceives the rendered UI/artifact and Verel gates the code → both collapse to one verdict → the agent fixes what failed → it loops → an attested PASS → merge. Eyes and brain, one nervous system.
  • Outcome. Nothing the agent builds ships unverified — functionally or visually — and every green is one you can prove.
  • See it run: the full arc across the demos; start with demo_selfheal.py (brain) and demo_overflow_loop.py (eyes).

Which one is you?

If you… Start with Then
ship agent-written code daily and can't review it all (P1) 1, 2 7, 5
run agents toward production (P2) 4, 3 6, 5
build UIs with agents (P3) 7, 8 9, 10
need verifiable, attestable evidence (P4) 3 1, 8

The fastest way to know it works on your code: pick the row above, run the linked demo, then point it at one real agent-authored PR. The painkiller is #1 (the conscience); the most visible is #7 (the eyes). Everything else compounds from there.