Working memory for agents

Use Case 2.2 from the design overview (§2.2). A long-running research agent stores its intermediate findings as a sheet; humans graduate candidates to verified.

The shipped sheet is examples/research-memory/.

The shape

contract.yaml columns:
  id          string  PK
  query       string  agent + human
  url         string  agent + human
  title       string  agent + human
  snippet     string  agent + human
  status      string  human-only ("candidate" / "verified" / "rejected")
  notes       string  human-only
  domain      string  derived (python — host extraction from url)

The agent writes the first five columns. Humans gate status and notes. The python derivation extracts a normalized hostname from url so the dashboard can group findings by source without an LLM call.

Why deterministic enrichment matters

A research agent can produce hundreds of findings per session. The review interface needs to:

group by source (so you can spot a single domain dominating the list),
filter by status,
sort by recency.

All three are SQL-friendly given a domain column. So the derivation is a tiny python script:

import json, sys
from urllib.parse import urlparse

inputs = json.loads(sys.argv[2])
host = urlparse(inputs.get("url", "") or "").hostname or ""
if host.startswith("www."):
    host = host[4:]
print(host)

targets: [domain]
inputs: [url]
kind: python
script: url_to_domain

Cache hit on every run unless the URL changes. No tokens, no API calls, no flaky tests.

Walking through the sheet

folio validate examples/research-memory
folio materialize examples/research-memory --actor agent:demo

folio query examples/research-memory \
  "SELECT domain, COUNT(*) AS n
     FROM records
    GROUP BY domain
    ORDER BY n DESC"

[{"domain":"world-nuclear.org","n":1},
 {"domain":"iea.org","n":1},
 {"domain":"nature.com","n":1},
 ...]

Open the Viewer:

folio serve examples/research-memory --port 3000 --actor agent:human

Use the Records tab to:

skim findings by query (the column the agent grouped them under),
update status from candidate → verified / rejected,
add notes for findings that need follow-up.

A typical session

Agent runs. The agent writes ~50 findings per query under status: "candidate".
folio materialize fills domain. The python derivation runs for new rows; existing rows cache-hit. Free.
Human reviews. Skim the Viewer. Promote good findings, reject the obviously wrong ones, leave the genuinely uncertain at candidate for a deeper review later.
Next agent run. The agent re-runs against new queries. Existing verified and rejected findings are not touched (the agent’s prompt tells it to skip rows it didn’t add).

Extending

Natural additions:

An ai derivation that scores plausibility (relevance_score: number) per finding — a cheap pre-filter before human review.
A second sheet research-projects/ keyed by query, with one row per active research thread, and a cross sheet derivation that pulls the thread’s priority into each finding.