Skip to content

Evolution loop

After every run, oe evolve <run-id> asks the same LLM that ran the experience to propose graph upgrades based on what it observed. The proposals are diffs you git apply — never auto-applied. This is what closes the author → run → evolve loop.

What the advisor reads

EvolutionAdvisor.analyze() is given four inputs:

  1. The full experience.yaml source — so the LLM knows the current graph.
  2. The parsed ExperienceSpec — typed for analysis.
  3. The run's event log — every node.*, state.write, plus a sample of activity events.
  4. The state diff — per-field, the before/after values across the run.

From this, the advisor produces up to 5 proposals, each with:

ts
interface EvolutionProposal {
  operation: 'add-node' | 'tune-param' | 'add-dataset-case'
  confidence: 'high' | 'medium' | 'low'
  title: string // "Add `security` dimension"
  rationale: string // why this would help
  diff: string // unified diff snippet ready for `git apply`
}

The output

oe evolve r-abc123 writes a markdown file:

markdown
# Evolution Proposals for run `r-abc123`

Generated by OpenExpertise EvolutionAdvisor. Review each suggestion
and `git apply` the diff blocks you accept; runtime never auto-applies.

## 1. Add `security` dimension _(add-node, confidence: medium)_

Default reviewers focus on logic + tests. The injection-class bug
on /users/<id> went through all three dimensions without anyone
catching it. A dedicated security reviewer would notice f-strings
in raw SQL.

\`\`\`diff
--- a/examples/review-branch/tools/list_dimensions.mjs
+++ b/examples/review-branch/tools/list_dimensions.mjs
@@ -2,6 +2,7 @@
export default async function () {
return {
state_delta: {
dimensions: [
{ key: 'bugs', focus: 'logic errors, null derefs' },
{ key: 'perf', focus: 'performance regressions' },
{ key: 'tests', focus: 'missing tests' },

-        { key: 'security', focus: 'injection, authz, secrets' },
         ],
       },
  }
  }
  \`\`\`

You read the proposal, decide whether you agree, and apply:

bash
# Extract the embedded diff and apply it
awk '/^```diff$/{f=1;next} /^```$/{f=0} f' \
  .openexpertise/evolution/r-abc123.md | git apply

# Re-run to see the effect
oe run examples/review-branch

The three operations

The advisor only proposes three kinds of changes:

add-node

The most powerful. Add a new node (or new dimension item) to address a gap.

Example: "Add a verify_finding node between bug_review and score to filter false positives." The diff inserts a YAML block + a prompt file.

tune-param

Adjust an existing parameter. Most commonly:

  • Change on_error: { policy: skip }retry
  • Increase for_each.concurrency
  • Tighten an AJV schema (add a required field)

Example diff:

diff
       writes: [findings]
-      timeout_ms: 60000
+      timeout_ms: 300000

add-dataset-case

Used when the experience seeds a list via a tool (like list_dimensions.mjs). The advisor proposes adding an item to the list — e.g., a new dimension for the review-branch example.

Confidence levels

Each proposal carries a confidence:

  • high — the advisor has strong evidence. Usually triggered by clear failure patterns (a node always failing, a finding the verifier always rejects, etc.).
  • medium — pattern-recognized but the advisor isn't sure it generalizes.
  • low — speculative. The advisor is reaching.

Treat high as "probably apply", medium as "read carefully", low as "interesting hypothesis, maybe not actionable yet". The CLI doesn't gate on confidence — that's your editorial role.

When to evolve vs hand-edit

SituationEvolve or hand-edit?
"My graph missed an obvious thing in a run."Evolve. The advisor sees the run; you'd be re-deriving its observations.
"I have a new requirement no run revealed."Hand-edit. The advisor has no signal for unobserved needs.
"The prompt is producing weak findings."Hand-edit the prompt. The advisor doesn't propose prompt rewrites in V1.
"I want to add OpenAI provider support."Hand-edit YAML. Not a per-run pattern.
"I'm not sure why a node took 2 minutes."Look at oe inspect <run-id>. Then either hand-edit or evolve.

What the advisor does NOT do

  • Auto-apply. The contract is explicit: every proposal lands in a markdown file, you decide. No oe evolve --apply flag in V1.
  • Reach across runs. Each oe evolve <run-id> looks at one run's events + state diff. To find patterns across multiple runs, write a loop yourself.
  • Suggest deletions. It proposes additions and tunings, never "delete this node". Deletions are a human concern.
  • Edit prompts. The advisor doesn't propose changing the body of a .md prompt file. Prompt rewriting is a different LLM task (could be a V2 capability).
  • Validate its own proposals. The diffs it emits may not git apply cleanly. The CLI doesn't validate the diffs before writing them. You'll know within seconds (git apply errors).

Closing the loop

The full lifecycle:

        oe ultra "<task>"                                      [LLM authors the YAML]

   experience.yaml + tools/* + prompts/*                       [git-tracked artifact]

        oe run examples/<name>                                 [LLM runs the agent nodes]

     .openexpertise/runs/<id>.jsonl + state.sqlite             [trace + state]

        oe evolve <run-id>                                     [LLM proposes upgrades]

     .openexpertise/evolution/<id>.md                          [markdown with diff]
              ↓ (you decide)
        git apply ...                                          [the graph improves]

        oe run examples/<name>                                 [run the upgraded graph]

        … and around again

The same LLM provider drives all three stages — authoring, runtime, evolution. They share the same llm-factory in packages/cli/src/llm-factory.ts. One closed loop, three roles for one model.

Worked example

See examples/review-branch for the canonical evolution-loop walkthrough: 3-dimension review → miss SQL injection → advisor proposes security dimension → re-run catches it.

→ Continue with the evolution advisor guide for the operational how-to.

Released under the MIT License.