Resume from cache
Problem
A long-running experience with 10+ nodes fails on node 7 after 40 minutes of LLM calls. You fix the bug and want to continue from node 7 — not replay nodes 1-6.
Solution
# experience.yaml — nothing special needed; caching is on by default
state:
schema:
pr_url: { type: string }
diff: { type: string }
dimensions: { type: array, items: { type: object }, merge: set_once }
findings: { type: array, items: { type: object }, merge: array_append }
risk_score: { type: number }
gate_result: { type: object }
graph:
nodes:
- id: fetch_diff
kind: tool
impl: ./tools/fetch_diff.mjs
reads: [pr_url]
writes: [diff]
- id: list_dimensions
kind: tool
impl: ./tools/list_dimensions.mjs
writes: [dimensions]
- id: investigate
kind: agent
prompt: ./prompts/investigate.md
reads: [diff]
for_each:
source: $.dimensions
concurrency: 4
schema:
type: object
properties:
findings:
type: array
items: { type: object }
required: [findings]
writes: [findings]
- id: score
kind: tool
impl: ./tools/score.mjs
reads: [findings]
writes: [risk_score]
- id: gate
kind: agent
prompt: ./prompts/gate.md
reads: [risk_score, findings]
schema:
type: object
properties:
gate_result: { type: object }
required: [gate_result]
writes: [gate_result]
edges:
- { from: fetch_diff, to: list_dimensions }
- { from: list_dimensions, to: investigate }
- { from: investigate, to: score }
- { from: score, to: gate }# First run — fails at the `gate` node
oe run experience.yaml --input pr_url=https://github.com/org/repo/pull/42
# → run-id: run_abc123
# Check what happened
oe inspect run_abc123
# Fix the prompt, then resume from the last checkpoint
oe resume run_abc123Walkthrough
Every completed node writes its output to .openexpertise/runs/<run-id>/state.sqlite. The cache key for each node is a deterministic hash of: the node's YAML definition (id, kind, prompt path, schema, args), the content hash of any prompt files or tool .mjs files it references, and the values of its reads fields from state at dispatch time.
When oe resume <run-id> is called, it reads the existing state from the run directory. Nodes whose cache key matches a previously-completed entry are skipped; their state delta is re-applied from the cache. Nodes that failed or were never reached are executed fresh.
This means: if you fix the gate prompt file, only gate re-runs. The four investigate iterations (and the LLM calls they made) are reused. If you change the investigate prompt, all investigate iterations re-run plus score and gate (because they depend on findings, which changed).
Cache invalidation happens automatically: changing any file that feeds into a node's cache key busts that node and all of its transitive successors.
To force a full re-run ignoring the cache: oe run experience.yaml --no-cache.
To re-run only a specific node: oe resume <run-id> --from score (re-runs score and everything downstream).
Variations
- Pin cache across runs: Add
cache_key_extra: v2to a node definition to namespace the cache. Bumping it fromv2tov3busts the node for new runs without invalidating old run directories. - Disable caching for a node: Set
cache: falseon a node that must never be memoized (e.g., afetch_current_timetool). - Inspect cache hits:
oe inspect <run-id>showscached: truein the timeline for each node that was served from cache.