Eyes → Brain: the handoff path¶
The anatomy¶
Eyes don't decide. In the human visual loop the retina perceives, the optic nerve carries the signal to the brain, the brain decides, a motor signal moves the hand, and then the eyes look again. Perception is the afferent (inbound) half; action is the efferent (outbound) half.
AgentVision is the afferent pathway for an agent. It gives an agent eyes — but it deliberately does not decide what to do, edit your code, or rewrite your prompt. It perceives the rendered artifact and hands a clean, structured signal back to the brain: whatever does the reasoning, planning, or memory in your stack.
AgentVision (the eyes) your agent (the brain)
┌──────────────────────────────┐ ┌────────────────────────────────┐
│ render → perceive → grade │ ───► │ decide → act (edit code / │
│ = Report ⟶ Handoff signal │ afferent│ refine prompt) = efferent │
└──────────────────────────────┘ └────────────────────────────────┘
▲ │
└────────────── look again ──────────────┘
This separation is the point: AgentVision stays provider-agnostic and brain-agnostic. Any reasoning layer, any memory system, any orchestrator can consume the same signal.
The signal: Handoff¶
A Report is the full sensory detail. A Handoff is that report distilled
into what a brain acts on directly — produced by report.to_handoff():
{
"perceived": "fail", // what the eyes concluded: pass | warn | fail
"next_action": "revise", // done | revise | review
"matches_intent": false, // null when no brief was given
"summary": "…",
"todo": [ // prioritized, actionable (defects first, then unmet musts)
"[overflow] hero text overflows its container on the right",
"[intent/must] a \"Checkout\" button is visible"
],
"open_questions": [ // what perception could NOT confirm — the brain should verify
"Verify: uses the brand's dark theme"
],
"artifact": "/…/render.png",
"backend": "anthropic",
"model": "claude-haiku-4-5"
}
Contract, in one line: next_action tells the brain whether to stop (done), act on
todo and look again (revise), or apply judgment (review). todo is the efferent
work-list. open_questions is what the eyes saw but couldn't decide — never silently
dropped.
Getting the signal¶
# CLI — any perception command can emit the handoff instead of the full report
agentvision analyze ./page.html --handoff
agentvision conform ./infographic.png --brief "…" --handoff
agentvision check ./page.html --handoff # offline, no key
# Library
from agentvision import analyze
report = await analyze("./page.html", brief=brief)
signal = report.to_handoff()
- MCP: the
perceive_handofftool returns the signal directly (for hosts that want the decision, not the full report). - REST:
POST /handoffreturns the signal + anartifact_id. - Loop: every
IterationResultcarries ahandoff, and each iteration writes ahandoff.jsonnext to itsreport.json.
Wiring it into a brain (provider-agnostic recipe)¶
The handoff is designed to drive a loop without the brain understanding AgentVision's internals:
from agentvision import analyze, NextAction
async def build_until_it_looks_right(make_or_edit, source, brief=None, max_steps=4):
for _ in range(max_steps):
signal = (await analyze(source, brief=brief)).to_handoff()
if signal.next_action == NextAction.DONE:
return True # eyes say it's right — safe to claim done
if signal.next_action == NextAction.REVIEW:
# uncertain/minor — let the brain decide (ask the user, accept, or push on)
...
# efferent step: the BRAIN acts on the work-list, then we look again
make_or_edit(todo=signal.todo, questions=signal.open_questions)
return False
Persisting to memory (any memory/brain system)¶
The Handoff is also the natural unit of visual episodic memory — one record per
look. Store it (it's plain JSON) keyed by artifact + iteration, and your reasoning layer can
recall "what did the eyes say last time, and did my fix actually resolve it?" AgentVision
emits the signal in a stable, self-describing shape (schema_version) precisely so any
external brain or memory store can own persistence and recall — AgentVision does not assume
or require a particular one. Verel is the reference implementation
of this pattern.
Reference brain: Verel¶
Verel is a full agent brain built on exactly this
split. It consumes AgentVision as its Eyes organ (verel.senses.sight): each Report is
mapped into a unified verdict bus (vision graded alongside tests, lint, and types), and
verified perceptions — including intent conformance (matches_intent and the
satisfied/total counts) — compound into its memory across iterations.
One design note worth copying: a full brain like Verel ingests the rich Report directly,
because it runs its own verdict gate and its own progressed/stuck detection — so it
deliberately ignores AgentVision's next_action (the brain keeps decision authority). The
Handoff is for simpler brains that want the decision pre-distilled. Both consume the same
eyes; pick the level that fits your brain.
Why the split matters (honesty)¶
- AgentVision will not pretend to be the brain. It reports
next_action: reviewfor uncertain/minor cases rather than forcing a decision it can't justify. open_questionssurfaces the limits of perception (low-confidence findings, unverifiable intent) so the brain can resolve them instead of inheriting a false PASS.- The signal is advisory where perception is advisory (vision-model findings) and trustworthy where it is grounded (DOM/OCR/CV) — the brain keeps final authority.
See also: The Loop · Conformance · Adapters · Integrations.