Evolution loop

After every run, oe evolve <run-id> asks the same LLM that ran the experience to propose graph upgrades based on what it observed. The proposals are diffs you git apply — never auto-applied. This is what closes the author → run → evolve loop.

What the advisor reads

EvolutionAdvisor.analyze() is given four inputs:

The full experience.yaml source — so the LLM knows the current graph.
The parsed ExperienceSpec — typed for analysis.
The run's event log — every node.*, state.write, plus a sample of activity events.
The state diff — per-field, the before/after values across the run.

From this, the advisor produces up to 5 proposals, each with:

interface EvolutionProposal {
  operation: 'add-node' | 'tune-param' | 'add-dataset-case'
  confidence: 'high' | 'medium' | 'low'
  title: string // "Add `security` dimension"
  rationale: string // why this would help
  diff: string // unified diff snippet ready for `git apply`
}

The output

oe evolve r-abc123 writes a markdown file:

markdown

# Evolution Proposals for run `r-abc123`

Generated by OpenExpertise EvolutionAdvisor. Review each suggestion
and `git apply` the diff blocks you accept; runtime never auto-applies.

## 1. Add `security` dimension _(add-node, confidence: medium)_

Default reviewers focus on logic + tests. The injection-class bug
on /users/<id> went through all three dimensions without anyone
catching it. A dedicated security reviewer would notice f-strings
in raw SQL.

\`\`\`diff
--- a/examples/review-branch/tools/list_dimensions.mjs
+++ b/examples/review-branch/tools/list_dimensions.mjs
@@ -2,6 +2,7 @@
export default async function () {
return {
state_delta: {
dimensions: [
{ key: 'bugs', focus: 'logic errors, null derefs' },
{ key: 'perf', focus: 'performance regressions' },
{ key: 'tests', focus: 'missing tests' },

-        { key: 'security', focus: 'injection, authz, secrets' },
         ],
       },
  }
  }
  \`\`\`

You read the proposal, decide whether you agree, and apply:

bash

# Extract the embedded diff and apply it
awk '/^```diff$/{f=1;next} /^```$/{f=0} f' \
  .openexpertise/evolution/r-abc123.md | git apply

# Re-run to see the effect
oe run examples/review-branch

One run, or many: stable patterns vs one-off blips

A single run can mislead. A node that failed once might have hit a transient timeout; a finding the verifier rejected once might be noise. To tell a stable pattern from a one-off blip, the advisor can analyze several runs together:

bash

oe evolve --runs r-abc123,r-def456,r-ghi789

Given multiple runs, the advisor weighs evidence that recurs across ≥2 runs more heavily than anything that shows up only once. Cross-run proposals land in their own file:

.openexpertise/evolution/cross-run-3runs-r-abc123.md

This sharpens the self-improving story: instead of reacting to whatever the last run happened to surface, you steer the graph toward the upgrades the evidence keeps pointing at. The single-<run-id> path is unchanged — use it for a focused look at one run, --runs when you want the durable signal.

The three operations

The advisor only proposes three kinds of changes:

`add-node`

The most powerful. Add a new node (or new dimension item) to address a gap.

Example: "Add a verify_finding node between bug_review and score to filter false positives." The diff inserts a YAML block + a prompt file.

`tune-param`

Adjust an existing parameter. Most commonly:

Change on_error: { policy: skip } → retry
Increase for_each.concurrency
Tighten an AJV schema (add a required field)

Example diff:

diff

       writes: [findings]
-      timeout_ms: 60000
+      timeout_ms: 300000

`add-dataset-case`

Used when the experience seeds a list via a tool (like list_dimensions.mjs). The advisor proposes adding an item to the list — e.g., a new dimension for the review-branch example.

Confidence levels

Each proposal carries a confidence:

high — the advisor has strong evidence. Usually triggered by clear failure patterns (a node always failing, a finding the verifier always rejects, etc.).
medium — pattern-recognized but the advisor isn't sure it generalizes.
low — speculative. The advisor is reaching.

Treat high as "probably apply", medium as "read carefully", low as "interesting hypothesis, maybe not actionable yet". The CLI doesn't gate on confidence — that's your editorial role.

When to evolve vs hand-edit

Situation	Evolve or hand-edit?
"My graph missed an obvious thing in a run."	Evolve. The advisor sees the run; you'd be re-deriving its observations.
"I have a new requirement no run revealed."	Hand-edit. The advisor has no signal for unobserved needs.
"The prompt is producing weak findings."	Hand-edit the prompt. The advisor doesn't propose prompt rewrites in V1.
"I want to add OpenAI provider support."	Hand-edit YAML. Not a per-run pattern.
"I'm not sure why a node took 2 minutes."	Look at `oe inspect <run-id>`. Then either hand-edit or evolve.

What the advisor does NOT do

Auto-apply. The contract is explicit: every proposal lands in a markdown file, you decide. No oe evolve --apply flag in V1.
Auto-discover runs. It analyzes the runs you name — a single <run-id>, or the comma-separated list you pass to --runs. It won't crawl .openexpertise/runs/ to pick runs for you; you choose which traces to feed it.
Suggest deletions. It proposes additions and tunings, never "delete this node". Deletions are a human concern.
Edit prompts. The advisor doesn't propose changing the body of a .md prompt file. Prompt rewriting is a different LLM task (could be a V2 capability).
Validate its own proposals. The diffs it emits may not git apply cleanly. The CLI doesn't validate the diffs before writing them. You'll know within seconds (git apply errors).

Closing the loop

The full lifecycle:

        oe ultra "<task>"                                      [LLM authors the YAML]
              ↓
   experience.yaml + tools/* + prompts/*                       [git-tracked artifact]
              ↓
        oe run examples/<name>                                 [LLM runs the agent nodes]
              ↓
     .openexpertise/runs/<id>.jsonl + state.sqlite             [trace + state]
              ↓
        oe evolve <run-id>                                     [LLM proposes upgrades]
              ↓
     .openexpertise/evolution/<id>.md                          [markdown with diff]
              ↓ (you decide)
        git apply ...                                          [the graph improves]
              ↓
        oe run examples/<name>                                 [run the upgraded graph]
              ↓
        … and around again

The same LLM provider drives all three stages — authoring, runtime, evolution. They share the same llm-factory in packages/cli/src/llm-factory.ts. One closed loop, three roles for one model.

Worked example

See examples/review-branch for the canonical evolution-loop walkthrough: 3-dimension review → miss SQL injection → advisor proposes security dimension → re-run catches it.

→ Continue with the evolution advisor guide for the operational how-to.

Evolution loop ​

What the advisor reads ​

The output ​

One run, or many: stable patterns vs one-off blips ​

The three operations ​

add-node ​

tune-param ​

add-dataset-case ​

Confidence levels ​

When to evolve vs hand-edit ​

What the advisor does NOT do ​

Closing the loop ​

Worked example ​