The evolution advisor

Ask the LLM to read a completed run and propose concrete improvements to experience.yaml.

When you need this

You have run an experience at least once and want suggestions on what to add or tune.
A run completed but some dimensions produced weak findings — you want the advisor to notice and propose adding coverage.
You want to tune retry policies, model aliases, or threshold values based on observed behavior.
You are building a feedback-driven improvement cycle without manually reading every event log.

The minimal example

bash

oe run examples/review-branch
# → run_id: abc123

oe evolve examples/review-branch --run-id abc123
# → wrote .openexpertise/evolution/abc123.md  (3 proposals)

oe diff examples/review-branch
# → prints each proposal with its diff block

How it works

oe evolve calls EvolutionAdvisor.analyze() (packages/evolution/src/advisor.ts), which assembles an input payload and sends it to the LLM with a structured-output tool.

What the advisor reads:

Input	Source
`experience_yaml`	Full text of `experience.yaml`
`run_event_count`	Total events in the run log
`sample_events`	First 30 events from `.openexpertise/runs/<run-id>.jsonl`
`state_diff`	Per-field `{ before, after }` diffs from the SQLite history table, filtered to the given `run_id`

The state_diff is computed by evolveCommand: it reads the state history for every schema field, filters to rows written during the target run, and emits { field, before: first_write.value_old, after: last_write.value_new }.

Proposal operations (V1):

Operation	What it does
`add-node`	Insert a new node + connecting edges. Diff is a unified diff of `experience.yaml`.
`tune-param`	Adjust a literal — a threshold, a model alias, a prompt path, a phase label. Diff is a unified diff.
`add-dataset-case`	Append rows to a dataset source (e.g., add a missing dimension to a fan-out list). Diff is a JSON array of rows.

Forbidden operations (the system prompt explicitly prohibits): removing nodes, rewiring edges, changing a node's kind, or modifying state.schema.

Confidence levels:

high — strong evidence from the run (e.g., a specific missing dimension referenced in the state diff).
medium — reasonable inference (e.g., a pattern in the event log suggesting a retry would help).
low — speculative (e.g., a general best-practice improvement not directly evidenced by this run).

The advisor returns up to 5 proposals, sorted by relevance. Each proposal has: operation, confidence, title, rationale (one paragraph citing evidence), and diff (the patch or rows to append).

Output: oe evolve writes the rendered Markdown to .openexpertise/evolution/<run-id>.md. The file is never auto-applied — git apply is always a manual step.

Cross-run analysis: stable patterns vs one-off blips

A single run can mislead: one bad LLM completion, one transient 429, one empty fan-out. To separate signal from noise, point the advisor at several runs at once with --runs:

bash

oe evolve --experience examples/review-branch --runs abc123,def456,ghi789
# → wrote .openexpertise/evolution/cross-run-3runs-abc123.md

Instead of analyzing one run, the advisor compares the runs against each other and prioritizes by recurrence:

A pattern that recurs in ≥2 runs (the same failure, the same missing focus area, the same drift) is a stable pattern. The advisor proposes an edit and rates it high (recurs in most/all runs) or medium (recurs in ≥2 but not all). The rationale names the specific runs it was seen in (e.g. "seen in abc123 and ghi789").
A pattern that appears in only one run is a one-off blip. The advisor either omits it or includes it only at low confidence, stating explicitly that it was a single-run anomaly that may not warrant action yet.

This means a high-confidence cross-run proposal is much stronger evidence than a high from a single run — it has been corroborated across executions. The output lands in .openexpertise/evolution/cross-run-<n>runs-<first-id>.md (the single <run-id> path is unchanged and still writes <run-id>.md).

When to use which

Use the single-run path while you're actively iterating (fast, one run to reason about). Switch to --runs a,b,c once you have a handful of real runs and want to invest only in changes the data corroborates more than once.

Variations

Force a specific LLM provider for the advisor:

bash

oe evolve examples/review-branch --run-id abc123 --llm openai

Run the advisor programmatically:

import { EvolutionAdvisor } from '@openexpertise/evolution'
import { AnthropicLLMClient } from '@openexpertise/node-kinds-agent'

const advisor = new EvolutionAdvisor({
  client: new AnthropicLLMClient(),
  model: 'claude-opus-4-5',
})

const proposals = await advisor.analyze({
  experienceSpec: spec,
  experienceYamlSource: yamlText,
  runEvents: events,
  stateDiff: [{ field: 'findings', before: [], after: [{...}] }],
})

console.log(advisor.renderMarkdown(proposals, runId))

Fan-out dimension detection — the advisor has special handling for for_each-based fan-outs: if the state diff hints at a domain area (raw SQL → security, missing logs → observability) not in the current dimensions list, it prefers add-dataset-case proposals to add the missing focus area.

Gotchas

The sample_events cap is 30. For long runs, the advisor only sees the first 30 events. Important mid-run or end-run events may not be included. Future versions may sample differently.
The advisor sees the experience.yaml as text, not as a parsed AST. Diff line numbers in proposals reference the raw text; a reformatted YAML may cause git apply to fail. See Applying proposals.
No memory across multiple oe evolve calls. Each call is a fresh LLM invocation. The advisor does not track which proposals were applied in earlier runs.
oe diff only prints; it does not apply. Use git apply manually after reviewing each diff block.

The evolution advisor ​

When you need this ​

The minimal example ​

How it works ​

Cross-run analysis: stable patterns vs one-off blips ​

Variations ​

Gotchas ​

See also ​

The evolution advisor

When you need this

The minimal example

How it works

Cross-run analysis: stable patterns vs one-off blips

Variations

Gotchas

See also