Skip to content

Verel — Architecture & Roadmap

Verel is an agent framework built on one idea: every agent action is a hypothesis, and nothing is "done" until a grader returns a verdict. A single verdict bus unifies every kind of check — vision, tests, lint, types — into one pass / warn / fail, so progress, "done", and what compounds into memory are all decided in one place.

Verel eval-driven loop

This document describes how the pieces fit together and where the project is going. For the exact module layout see the module guide; for the release history see the changelog.


The five organs

Verel system architecture

Organ Module Responsibility
Verdict bus verel.verdict One Report/Percept schema for every sense; gate() reduces them to a verdict.
Eyes / Senses verel.senses AgentVision as a grounded perception adapter, plus the percept log.
Brain verel.memory The trust layer over a memory backend: what is believed, how strongly, and what compounds.
Fleet verel.fleet Agents managing agents — manager fan-out, scheduler, isolated worktrees.
Tool-smith verel.toolsmith Agents building, testing, and registering their own tools.
Agent-run CI/CD verel.ci Graders, staged pipeline, self-healing, and verdict-driven rollback.

The verdict bus (verel.verdict)

Every grader emits a Report of Issues with a verdict. gate() reduces a set of reports to a single verdict under a few load-bearing rules:

  • Advisory ceiling. Per-issue trust keys off the issue source. Precise sources (DOM / CV / OCR / test / lint / typecheck) gate at full severity; advisory sources (vision / LLM-judge) are clamped to at most warn. An advisory opinion never gates a hard failure.
  • Grader attestation. A required grader must present a signed run_receipt proving it ran the frozen suite over the changed files. A hollow PASS, issues=[] with no receipt fails the gate — "present-but-empty" can't mint green.
  • Scrubbed fingerprints. Each issue gets a stable, normalized fingerprint (line numbers, addresses, timestamps, floats scrubbed) so the same logical failure hashes identically across runs — which is what makes stuck-detection reliable.
  • Stuck vs. progress. Progress is defined as strict shrinkage of the gating-failure set. Pure churn or growth is not progress; a new gating issue is a regression.

The Brain — memory that compounds (verel.memory)

Memory is state stored outside the model and selectively re-injected. Verel owns the trust layer over a (swappable) backend — LocalMemory (zero-dependency SQLite) or mem0 — behind a single MemoryView protocol.

Each record carries two orthogonal quantities, never collapsed into one: - epistemic_confidence — how true we believe it is. Moved only by corroboration (+) and contradiction (−). Retrieval never touches it. - retrieval_strength — how reachable it is. Decays with disuse, resets on recall. The decay is adaptive: a memory's effective half-life stretches with demonstrated usefulness (support_count + epistemic_confidence), so a corroborated rule outlives a one-off — tuning of reachability only, never of truth.

On top of that: - Interference rule — a new value for the same (subject, predicate, scope) supersedes rather than silently duplicating. - Consolidation — an offline pass clusters recurring failures (by kind, or by meaning with an embedder) and induces a candidate structured DesignRule (condition → action, applies_to). induce_hierarchy then climbs a multi-hop SCHEMA ladder — rules → order-2 principles → order-3 meta-principles — until the corpus stops supporting a higher level, so the top is the most general principle the evidence backs. consolidate_across_scopes lifts a pattern that recurs across several repos into a global rule (and refuses a single-repo quirk). All start inferred; height and breadth never confer trust. - Contradiction-driven revision (revise_with_counterexample) — consolidation can also be wrong. A new failure in a rule's domain that the rule failed to prevent is a counterexample: it's recorded (via annotate, no corroboration), the rule is contradicted, and once enough counterexamples accumulate the rule is split into a narrowed general rule (which supersedes the original) plus a specific exception rule — or, if belief collapses, rejected. A split then propagates up (propagate_revision): every SCHEMA that subsumed the rule is re-derived from its now-revised members and reset to candidate, climbing the hierarchy, so a corrected leaf never leaves an over-claiming principle above it. Revision only ever lowers trust or narrows scope. - Promotion gate — a candidate reaches verified only by passing a held-out, agent-inaccessible eval (with a leakage canary). Trust is earned, never asserted. - Failure ledger + regression guard — past gating failures are remembered; reintroducing a previously-fixed failure fails the gate from memory alone. - Scope lattice (self → team → org → global) — the spine of a shared brain. lattice_recall resolves down: an agent recalls across its own, its team's, and its org's knowledge at once, with the most-specific scope winning ties. graduate promotes up: a belief independently verified across sibling scopes becomes a parent-level candidate that must re-earn verified — collective knowledge no single agent decreed. Individual and collective memory are the same machinery at different radii of the lattice. - Hosted shared memory (MemoryServer / RemoteMemory) — for a fleet on different machines, a durable MemoryView behind a tiny HTTP service. RemoteMemory implements the same Protocol, so lattice_recall, graduate, consolidation, and the promotion gate all run against the shared brain unchanged. The server is the single writer — every access is lock-serialized, so the interference rule stays correct under concurrent agents. - Replicated, HA memory (ReplicatedMemory) — for no single point of failure, the store runs as a leader-fenced, fault-tolerant cluster. Exactly one node is leader (held by a fencing lease — the same monotonic-token primitive the fleet uses); the leader applies every mutation locally and replicates the resulting record state verbatim (so replication is idempotent) to its peers. Replication tolerates failure: an unreachable follower does not fail the write (it falls behind and catches up via sync_from), and a write_quorum sets how many nodes must hold a write to call it durable. A deposed leader is fenced (NotLeaderError on write, FencingError on a stale in-flight replicate) — no split-brain. Reads are served from any node's replica (eventual consistency). The lease store (the hosted control plane across machines) is the single source of fencing truth; ReplicaClient carries replication to follower MemoryServers over HTTP. A node that fell behind self-heals without an operator: the AntiEntropy reconciler periodically resolves the current leader (via the lease store's holder) and sync_froms it in the background. On-disk stores are crash-safe: LocalMemory opens WAL with synchronous=FULL, so a commit is fsync'd before the call returns — an acked write is durable before its replica is even sent, and survives a leader crash (durable=False relaxes to synchronous=NORMAL for speed where it's ok). Reads are local (eventual) by default — fast, may lag — or, with read_consistency='strong', routed to the current leader (the single writer, so it holds every committed write) for read-your-writes / linearizable-ish reads; it falls back to local if no leader can be resolved. Strong reads, though, fail when the leader is down. read_consistency='quorum' closes that gap: the leader stamps every mutation with a monotonic version (token * STRIDE + seq, so versions increase within a leader and across failovers), and a quorum get polls up to read_quorum replicas and returns the freshest copy — a point read that survives the leader being down, as long as a quorum of replicas hold the record. The same versions make replication reorder-/duplicate-safe: an older-version replicate never clobbers a newer copy (version_of). - Cross-agent trust — sharing a brain safely. import_belief applies the registry's "trust does not travel" rule to beliefs: a peer's claim enters as a candidate and only becomes verified by passing the importer's OWN check (its self-asserted confidence is ignored). AuthorTrust is a per-author reputation, stored in the brain itself: a contributor whose beliefs keep re-verifying earns a higher prior (their claims start more believed, surface sooner); a noisy one's falls. A fresh import's starting confidence is anchored to the author's reputation, not the peer's assertion — so a single bad actor can't move the collective. - The librarian (librarian_pass) — the gated maintenance cycle, the brain's "sleep." It orchestrates primitives that each earn their own trust: consolidate recurring failures into candidate rules, induce the schema hierarchy, graduate cross-scope beliefs up the lattice, and prune/decay what §5 allows. Steps that create knowledge write only candidate/inferred records (they still face the promotion gate); prune never touches a verified or pinned memory. So the brain compounds without rotting, and the librarian proposes and tidies — it never mints trust. Runs against any MemoryView, so it maintains the shared team brain too. - Recall — lexical by default; semantic (cosine) when an embedder is configured. - Lifecycle controlspinned memories ignore decay and are never pruned; volatile memories are kept only if corroborated/verified within a window; a hard ttl_s expires ephemeral environment facts; idle records are flagged stale; and supersedes keep a queryable correction chain instead of overwriting history.


The Fleet — agents managing agents (verel.fleet)

A control plane over agent execution:

  • Manager decomposes a goal into a fan-out of independent subtasks (LLM-driven, with the plane validating and clamping the decision — and falling back safely on bad output).
  • Scheduler — runs a Task DAG with barrier policies (all / k_of_n / optional), a concurrency cap, retry → quarantine, a hard budget lease, and WAL-based crash resume. Every node is gated by the verdict bus, so a worker can't self-declare done.
  • Concurrent managers — more than one scheduler can share a task store safely via fencing leases (lease.py): a lease carries a monotonic token, taking over an expired lease bumps it, and every terminal write is fenced — a stale leader whose token is no longer current is rejected, not allowed to corrupt shared state. Peers adopt each other's recorded outcomes, so each task runs exactly once. Backends: in-memory (one process) or sqlite (BEGIN IMMEDIATE, cross-process).
  • Git fencing sink — fencing isn't only in the task store: a pre-receive hook on the remote (fence_sink.py) refuses a push whose token isn't current, so a paused leader can't push stale code over a successor's. The pusher passes (resource, token) as git push options; the hook checks them against the same sqlite store.
  • Multi-repo coordinationplan_multi_repo namespaces per-repo tasks and adds cross-repo edges into one DAG, validated acyclic (a cross-repo cycle is rejected up front, never deadlocked). One fenced scheduler then enforces "ship the client only after the API builds".
  • Cross-repo atomic sagas — a change spanning repos commits as a saga (saga.py): each step has a forward action and a compensation, and a failure runs the compensations of the already-committed steps in reverse (a safe git revert, never a reset) — all-or-nothing.
  • Hosted control plane — for managers on different machines (no shared filesystem), the lease authority is wrapped in a tiny, dependency-free HTTP service (control_plane.py). The server is the clock authority (so skewed manager clocks can't disagree about expiry); a RemoteLeaseStore client speaks the same LeaseStore Protocol, so Scheduler(leases=RemoteLeaseStore(url)) coordinates cross-machine unchanged. Terminal writes are still fenced (a stale complete is a 409); an optional bearer token gates access.
  • Worktrees — each worker runs in its own isolated git worktree with an exclusive advisory lease, so parallel workers never stomp each other.

Tool-smith — agents build their own tools (verel.toolsmith)

Lifecycle: detect → scaffold → test → register → reuse. A capability request first tries reuse (semantic when an embedder is present); if missing, an LLM scaffolds a function, it is tested against held-out cases, and it is admitted to procedural memory only on a passing, attested eval. Read-only/idempotent tools auto-verify; destructive tools require a human-review verdict. Tool code is content-signed and executed under isolation (isolation="container" uses a bwrap namespace sandbox — no network, read-only fs — plus an optional seccomp-bpf syscall filter via verel[container], in three profiles: a default denylist; a default-deny allowlist jail (no network/subprocess/threads); and a per-tool capability jail that allows only the syscalls a tool exercised while passing its held-out eval, learned via strace and frozen onto the tool — so a verified tool that later attempts a new syscall is refused at the kernel).


Agent-run CI/CD (verel.ci)

Tests, lint, and types are first-class senses on the same bus — across Python, JS/TS, and Go (language= on each stage), plus perf and security senses. A GraderSpec carries its own parser, so pytest, go test -json, and a TAP runner — all GraderKind.TEST — parse by their own format while sharing one schema, one gate, and one stuck/progress signal. The staged pipeline:

Stage What runs
inner-loop lint / typecheck / fast unit on the working tree (per language)
pre-commit unit + affected tests + a failure-memory regression check
pre-merge full suite + lint + types, optionally security (SAST/audit) and a perf budget
post-merge / canary smoke/E2E; on a precise-evidence failure, an automated rollback

Perf and security are precise graders: a perf regression past an explicit budget, or a HIGH/CRITICAL security finding, gates (and can drive rollback) — sub-threshold findings only advise. Language toolchains live in verel.ci.LANGS; adding a runtime is one LangToolchain entry.

  • Self-healing — on failure the ci-medic classifies each issue (retry / regen-lockfile / quarantine-flaky / fix-branch) and, for genuine regressions, invokes the code-fixer agent, re-gating every round until the graders pass or it escalates.
  • Rollback policy engine — the agent proposes, a deterministic engine authorizes (only on precise gating evidence) and performs a safe, non-destructive git revert. A destructive action never depends on advisory evidence.

Surfaces

  • Library (import verel) · CLI (verel doctor|loop|fleet|heal|ci) · CI CLI / git hook (verel-ci, python -m verel.ci) · MCP server (verel-mcp).

Default LLM provider is Ollama Cloud; OpenAI is the bundled fallback, and the provider seam in agents/llm.py makes others (e.g. Claude) a small addition.


Roadmap

Done (all five organs, end-to-end): verdict bus with attestation; AgentVision sight adapter; the memory trust layer with consolidation + promotion gate (LocalMemory and mem0); semantic recall; the fleet (manager + scheduler + worktrees); the tool-smith with subprocess and container isolation; the full CI/CD stage table with self-healing and rollback; a content-addressed skill registry — now hosted over HTTP (RegistryServer/RemoteRegistry), with a cross-tenant transfer experiment that justified building it; CLI + MCP surfaces. The project is lint/type-clean, ships type information, and gates its own development through its own verdict bus in CI.

Next: - Broaden senses further — Rust/Java toolchains; richer perf harnesses; more SAST backends. - Consolidation: re-promote a revised schema automatically once its narrowed members re-verify (today a propagated schema returns to candidate and must earn verified again by hand). - Distributed hardening — replicate the control-plane store (today a single sqlite host is the authority); push-time identity (sign the push token to the fencing sink). - Skill-registry curation — reputation/provenance ranking now that the registry is hosted (RegistryServer); the two-model H2 sweep (88–89% transfer) justified building it. - Seccomp profile portability across architectures (the learned policy is x86-64-derived today).


Honest limits

  • The in-process tool guard is a guardrail, not a sandbox; real isolation is the container runner — namespace isolation (no network, read-only fs) plus a seccomp-bpf syscall filter in three profiles: a default denylist, a default-deny allowlist jail (no network/subprocess/ threads), and a per-tool capability jail allowing only the syscalls a tool earned while verified.
  • Advisory (vision/LLM) findings inform but never gate destructive actions.
  • Vision-model bounding boxes are advisory, not pixel-accurate; LLM outputs are not deterministic. Verel is explicit about which signals are precise and which are advisory.