Retry with backoff
Problem
External API calls fail transiently. A tool node that hits GitHub's REST API, a database, or an internal service will occasionally return 5xx errors or time out. You want the run to succeed without manual intervention.
Solution
graph:
nodes:
- id: fetch_repo_metadata
kind: tool
impl: ./tools/github.mjs
args:
repo: 'org/name'
writes: [repo_metadata]
on_error:
policy: retry
max_attempts: 3
backoff_ms: 500 # linear: 500 ms, then 1000 ms, then 1500 ms
- id: fetch_coverage
kind: tool
impl: ./tools/coverage_api.mjs
writes: [coverage]
on_error:
policy: retry
max_attempts: 4
backoff_ms: 1000
backoff: exponential # 1000 ms, 2000 ms, 4000 ms
- id: optional_enrichment
kind: agent
prompt: ./prompts/enrich.md
reads: [repo_metadata]
writes: [enriched]
on_error:
policy: skip # log + continue; downstream nodes tolerate missing field
edges:
- { from: fetch_repo_metadata, to: fetch_coverage }
- { from: fetch_coverage, to: optional_enrichment }Walkthrough
on_error.policy: retry tells the runtime to re-execute the node on any thrown error or non-zero exit. The node's state delta from the failed attempt is discarded; the node runs again from scratch.
backoff_ms sets the base delay in milliseconds. With the default linear strategy, the wait before attempt n is backoff_ms * n — so 500 ms, 1 000 ms, 1 500 ms for three attempts. With backoff: exponential the wait is backoff_ms * 2^(n-1) — doubling each time.
After max_attempts all fail, the node enters the failed state and the error propagates normally (the run fails unless a parent node or the experience-level on_error catches it).
policy: skip is the other common choice: the node is marked skipped, its writes are not committed, and downstream nodes that read its output must tolerate missing values (e.g., by checking when: '$.enriched != null' on their edges).
The built-in Anthropic and OpenAI LLM clients handle HTTP 429 (rate limit) errors with their own exponential backoff independently of on_error. You don't need on_error on kind: agent nodes for rate-limit resilience — only for schema-validation failures or genuinely transient model errors.
Variations
- Fail-fast for deterministic errors: Wrap the tool in a try/catch and re-throw only for retryable errors; throw a different sentinel error class for permanent failures that should not be retried.
- Custom max_attempts per environment: Use
args: { max_attempts: '{{$env.RETRY_ATTEMPTS}}' }and setRETRY_ATTEMPTSin your environment to override at runtime.