Use cases — what Verel + AgentVision are for¶
AI agents write code, UIs, charts, and PDFs — then declare "done" having never run the real checks or looked at the result. Verel is the brain (a conscience: it re-runs the real graders and returns an attested verdict an agent can't fake) and AgentVision is the eyes (it renders the output and grades what it actually looks like). Together: nothing an agent builds ships unverified — functionally or visually.
This page is organized by who you are and the moment you feel the pain — not by feature. Find the row that's you; each use case is a job to be done, what it costs you today, and what changes. Every one links to a runnable demo so you can see it on real output, not slideware (all demos →).
Who these are for¶
| Persona | The agentic pain | |
|---|---|---|
| P1 | AI-native / "vibe-coding" team (3–15 devs, >50% agent-written code, shipping daily) | The agent is author and reviewer; nobody has time to verify what it claims. |
| P2 | SRE / platform / DevEx running agents toward production | You own the blast radius when an agent's "done" is wrong at deploy. |
| P3 | Front-end / design-system team using AI agents | The agent can't see the UI it ships — overflow, contrast, 404s reach users. |
| P4 | Enterprise AppSec / compliance | A green check is a claim; you need verifiable, attestable evidence. |
The moments (the spine of the workflow)¶
| Moment | Use case | Organ |
|---|---|---|
| In the agent's loop (write-time) | 1. Agent says "done" — and it's lying · 7. Ships UIs it never looked at | brain · eyes |
| Pre-merge / PR | 2. More agent code than we can review · 3. Can't trust the green check | brain |
| UI / visual correctness | 8. Accessibility regressions · 9. Visual review without a human · 10. Is the chart/PDF correct? | eyes |
| Deploy | 4. A bad agent commit reached prod | brain |
| Runtime / over time | 5. Agents keep relearning the same lesson · 6. A fleet that collides | brain |
| End to end | 11. Verify everything an agent builds | both |
Part 1 — The done-gate (Verel: the conscience)¶
1. "My agent says done — and it's lying"¶
Who: P1 · Moment: the agent opens a PR.
- Trigger. Your agent finishes a task and reports "all tests pass — done."
- What it costs you today. You merge on trust. Agents demonstrably game the check — Anthropic
(Nov 2025) documented a production model faking a passing test with
sys.exit(0). Only 29% of developers trust AI output (Stack Overflow 2025), and incidents-per-PR are up 242% with AI adoption (DORA 2025). The "almost right but not quite" failure is the #1 frustration. - What changes. Verel re-runs the real graders itself (tests, lint, types) on the diff and
returns a verdict the agent didn't compute. The
sys.exit(0)fake-pass surfaces as a FAIL with groundedfile:lineissues before merge. The agent reads the verdict and self-corrects, looping until the graders — not the agent — go green. - Outcome. "Done" stops being a claim and becomes a verdict. The bad merge never happens.
- See it run:
python examples/demo_selfheal.py→ round 1fail→ agent patches source → round 2pass,terminated_on=passed.
2. "We merge agent code faster than we can review it"¶
Who: P1, P2 · Moment: PR review.
- Trigger. 20 agent-authored PRs a day, two humans to review them.
- What it costs you today. You rubber-stamp (and ship bugs) or bottleneck (and kill velocity). Review now costs more than writing — 11.4 vs 9.8 hrs/week (DORA 2025).
- What changes. Verel is the tireless first reviewer:
pytest+jest+go test+ lint + types + perf budget + security, all on one verdict, one gate, diff-scoped to stay under the ~10-min CI ceiling. Humans only look at what already passed the machine. - Outcome. Review capacity stops being the bottleneck; humans spend judgment on design, not on re-checking correctness a machine can.
- See it run:
python examples/demo_polyglot_ci.py— Python/JS/Go + perf + security on one bus.
3. "I can't trust the green checkmark"¶
Who: P2, P4 · Moment: anytime an agent can influence CI.
- Trigger. A green check arrives — but did the suite actually run on this diff? Could the agent (or a rerun-until-green flake) have minted it?
- What it costs you today. Green is a claim, not a proof. A hollow or gamed check is indistinguishable from a real one — fatal when an agent is in the loop.
- What changes. Every Verel verdict carries a signed receipt over
(suite_sha, inputs_digest, coverage_assertion, runner_identity): a hollow check can't mint green, the coverage must intersect the diff, and you — or another tool — can independently verify the receipt. Advisory signals (a vision or LLM hunch) inform but never gate a destructive action. - Outcome. A green you can trust without trusting the agent — and an audit trail for compliance.
- The moat: this is the one thing a platform vendor's self-attested gate can't honestly claim — an independent referee that doesn't make the agent can grade it.
Part 2 — Eyes on the output (AgentVision: the perception)¶
7. "My agent ships UIs it never looked at"¶
Who: P1, P3 · Moment: the agent builds a UI and declares it done.
- Trigger. The agent writes a page or component, reads the source and stdout, says "done" — and never renders it.
- What it costs you today. It ships a button overflowing its container, text failing WCAG
contrast, an image that 404s silently, a broken mobile layout — and reported
PASSthe whole time. The first "reviewer" is a real user. - What changes. AgentVision renders the output and perceives it — DOM geometry, WCAG contrast, OCR, network errors — and returns a machine-readable PASS/WARN/FAIL with coordinate-grounded issues. The agent consumes the report and self-corrects, looping until it actually passes. (Unlike Percy/Applitools, no human reviews screenshots — the agent does.)
- Outcome. The agent sees before it ships; visual breakage is caught in the loop, not in prod.
- See it run:
python examples/demo_overflow_loop.py— fix a UI until the eyes return PASS.
8. "Accessibility regressions slip through"¶
Who: P3, P4 · Moment: any UI change.
- Trigger. A redesign drops text contrast below WCAG AA; nobody runs an audit on every change.
- What it costs you today. Accumulating a11y debt and compliance/legal exposure, or the cost of manual audits that don't scale to every commit.
- What changes. WCAG contrast becomes a grader on every render — a precise, coordinate-
grounded
FAILon anything under 4.5:1, in CI, with no human in the loop. - Outcome. Accessibility is enforced continuously instead of audited occasionally.
9. "Visual review still needs a human to eyeball screenshots"¶
Who: P3 · Moment: the visual-testing step.
- Trigger. Every UI change waits on a human to approve a screenshot diff (Percy/Applitools) — or you do no visual testing at all.
- What it costs you today. A human-in-the-loop bottleneck on every visual change, or zero coverage of the thing users actually see.
- What changes. AgentVision emits a machine-readable verdict consumed autonomously by the agent or CI — visual correctness gated without a human approving screenshots.
- Outcome. Visual regressions are gated at machine speed; humans look only when the machine flags.
10. "Is the chart / PDF / export actually correct?"¶
Who: P1, P3 · Moment: the agent generates a non-web artifact.
- Trigger. The agent produces a chart, a PDF report, a dashboard, an export — and declares it done from the code alone.
- What it costs you today. Silent rendering errors in generated artifacts that only a human eye (eventually) catches.
- What changes. AgentVision perceives the rendered artifact (OCR + geometry) and checks it against intent — does the output actually look like what we set out to build?
- Outcome. Generated artifacts are verified by what they render to, not just by the code that emitted them.
Part 3 — Beyond the merge (Verel: the rest of the lifecycle)¶
4. "A bad agent commit reached prod"¶
Who: P2 · Moment: deploy / canary.
- Trigger. A change passed review and merged, but breaks at canary.
- What it costs you today. A manual rollback under incident pressure — or worse, it sits broken while you find out from users.
- What changes. A canary grader runs the merged code; on a precise gating failure Verel
performs a deterministic
git revertto the last good HEAD — and refuses to act when the only evidence is advisory (a hunch never triggers a destructive action). - Outcome. Bad deploys auto-revert on hard evidence; nothing destructive happens on a guess.
- See it run:
python examples/demo_canary_rollback.py.
5. "Our agents keep relearning the same lesson"¶
Who: P1, P2 · Moment: across sessions, repos, and tools, over time.
- Trigger. The agent repeats a mistake it "learned" last week; a fix found in one repo never reaches another; knowledge evaporates between sessions and between tools.
- What it costs you today. Zero compounding — every session starts cold, and one agent's hard-won fix dies with its context window.
- What changes. A shared verified memory: recall resolves down a
self→team→org→globallattice (the most specific wins), a fix verified across siblings graduates up, and a peer's claim re-verifies before it's trusted — so a noisy or malicious agent can't poison the swarm. - Outcome. The fleet compounds: lessons stick, spread, and survive — without trusting any single agent's say-so.
- See it run:
python examples/demo_shared_brain.py.
6. "We run a fleet of agents and they collide"¶
Who: P2 · Moment: orchestrating many agents across repos.
- Trigger. Two managers grab the same task; a multi-repo change lands in repo A but fails in B.
- What it costs you today. Double work, races, and half-applied changes you have to untangle by hand.
- What changes. Managers are fenced by leases (a stale leader's writes are refused — even at the git remote), and multi-repo work commits as an atomic saga that compensates everything already landed if any repo fails.
- Outcome. Every task runs exactly once; nothing is ever left half-applied.
- See it run:
python examples/demo_distributed_fleet.py.
Part 4 — End to end¶
11. "Verify everything an agent builds — the code and what it looks like"¶
Who: P1–P4 · Moment: the whole loop.
- The arc. The agent writes → AgentVision perceives the rendered UI/artifact and Verel gates the code → both collapse to one verdict → the agent fixes what failed → it loops → an attested PASS → merge. Eyes and brain, one nervous system.
- Outcome. Nothing the agent builds ships unverified — functionally or visually — and every green is one you can prove.
- See it run: the full arc across the demos; start with
demo_selfheal.py(brain) anddemo_overflow_loop.py(eyes).
Which one is you?¶
| If you… | Start with | Then |
|---|---|---|
| ship agent-written code daily and can't review it all (P1) | 1, 2 | 7, 5 |
| run agents toward production (P2) | 4, 3 | 6, 5 |
| build UIs with agents (P3) | 7, 8 | 9, 10 |
| need verifiable, attestable evidence (P4) | 3 | 1, 8 |
The fastest way to know it works on your code: pick the row above, run the linked demo, then point it at one real agent-authored PR. The painkiller is #1 (the conscience); the most visible is #7 (the eyes). Everything else compounds from there.