Evolution loop
After every run, oe evolve <run-id> asks the same LLM that ran the experience to propose graph upgrades based on what it observed. The proposals are diffs you git apply — never auto-applied. This is what closes the author → run → evolve loop.
What the advisor reads
EvolutionAdvisor.analyze() is given four inputs:
- The full experience.yaml source — so the LLM knows the current graph.
- The parsed
ExperienceSpec— typed for analysis. - The run's event log — every
node.*,state.write, plus a sample of activity events. - The state diff — per-field, the before/after values across the run.
From this, the advisor produces up to 5 proposals, each with:
interface EvolutionProposal {
operation: 'add-node' | 'tune-param' | 'add-dataset-case'
confidence: 'high' | 'medium' | 'low'
title: string // "Add `security` dimension"
rationale: string // why this would help
diff: string // unified diff snippet ready for `git apply`
}The output
oe evolve r-abc123 writes a markdown file:
# Evolution Proposals for run `r-abc123`
Generated by OpenExpertise EvolutionAdvisor. Review each suggestion
and `git apply` the diff blocks you accept; runtime never auto-applies.
## 1. Add `security` dimension _(add-node, confidence: medium)_
Default reviewers focus on logic + tests. The injection-class bug
on /users/<id> went through all three dimensions without anyone
catching it. A dedicated security reviewer would notice f-strings
in raw SQL.
\`\`\`diff
--- a/examples/review-branch/tools/list_dimensions.mjs
+++ b/examples/review-branch/tools/list_dimensions.mjs
@@ -2,6 +2,7 @@
export default async function () {
return {
state_delta: {
dimensions: [
{ key: 'bugs', focus: 'logic errors, null derefs' },
{ key: 'perf', focus: 'performance regressions' },
{ key: 'tests', focus: 'missing tests' },
- { key: 'security', focus: 'injection, authz, secrets' },
],
},
}
}
\`\`\`You read the proposal, decide whether you agree, and apply:
# Extract the embedded diff and apply it
awk '/^```diff$/{f=1;next} /^```$/{f=0} f' \
.openexpertise/evolution/r-abc123.md | git apply
# Re-run to see the effect
oe run examples/review-branchThe three operations
The advisor only proposes three kinds of changes:
add-node
The most powerful. Add a new node (or new dimension item) to address a gap.
Example: "Add a verify_finding node between bug_review and score to filter false positives." The diff inserts a YAML block + a prompt file.
tune-param
Adjust an existing parameter. Most commonly:
- Change
on_error: { policy: skip }→retry - Increase
for_each.concurrency - Tighten an AJV schema (add a required field)
Example diff:
writes: [findings]
- timeout_ms: 60000
+ timeout_ms: 300000add-dataset-case
Used when the experience seeds a list via a tool (like list_dimensions.mjs). The advisor proposes adding an item to the list — e.g., a new dimension for the review-branch example.
Confidence levels
Each proposal carries a confidence:
high— the advisor has strong evidence. Usually triggered by clear failure patterns (a node always failing, a finding the verifier always rejects, etc.).medium— pattern-recognized but the advisor isn't sure it generalizes.low— speculative. The advisor is reaching.
Treat high as "probably apply", medium as "read carefully", low as "interesting hypothesis, maybe not actionable yet". The CLI doesn't gate on confidence — that's your editorial role.
When to evolve vs hand-edit
| Situation | Evolve or hand-edit? |
|---|---|
| "My graph missed an obvious thing in a run." | Evolve. The advisor sees the run; you'd be re-deriving its observations. |
| "I have a new requirement no run revealed." | Hand-edit. The advisor has no signal for unobserved needs. |
| "The prompt is producing weak findings." | Hand-edit the prompt. The advisor doesn't propose prompt rewrites in V1. |
| "I want to add OpenAI provider support." | Hand-edit YAML. Not a per-run pattern. |
| "I'm not sure why a node took 2 minutes." | Look at oe inspect <run-id>. Then either hand-edit or evolve. |
What the advisor does NOT do
- Auto-apply. The contract is explicit: every proposal lands in a markdown file, you decide. No
oe evolve --applyflag in V1. - Reach across runs. Each
oe evolve <run-id>looks at one run's events + state diff. To find patterns across multiple runs, write a loop yourself. - Suggest deletions. It proposes additions and tunings, never "delete this node". Deletions are a human concern.
- Edit prompts. The advisor doesn't propose changing the body of a
.mdprompt file. Prompt rewriting is a different LLM task (could be a V2 capability). - Validate its own proposals. The diffs it emits may not
git applycleanly. The CLI doesn't validate the diffs before writing them. You'll know within seconds (git applyerrors).
Closing the loop
The full lifecycle:
oe ultra "<task>" [LLM authors the YAML]
↓
experience.yaml + tools/* + prompts/* [git-tracked artifact]
↓
oe run examples/<name> [LLM runs the agent nodes]
↓
.openexpertise/runs/<id>.jsonl + state.sqlite [trace + state]
↓
oe evolve <run-id> [LLM proposes upgrades]
↓
.openexpertise/evolution/<id>.md [markdown with diff]
↓ (you decide)
git apply ... [the graph improves]
↓
oe run examples/<name> [run the upgraded graph]
↓
… and around againThe same LLM provider drives all three stages — authoring, runtime, evolution. They share the same llm-factory in packages/cli/src/llm-factory.ts. One closed loop, three roles for one model.
Worked example
See examples/review-branch for the canonical evolution-loop walkthrough: 3-dimension review → miss SQL injection → advisor proposes security dimension → re-run catches it.
→ Continue with the evolution advisor guide for the operational how-to.