The Meta-Harness, Two Weeks Later: From Interesting Experiment to Working Contributor

Two weeks ago we wrote about the meta-harness — an autonomous loop that runs Chalie’s benchmarks, asks Opus what’s wrong, dispatches a coder to fix it, and ships a PR if the scores hold. We said it worked, but barely. Three lines of code for millions of tokens. The local coder was the weakest link. Small improvements were invisible under evaluation noise.

This is the follow-up. The system is now opening multiple human-merged PRs in an afternoon, and almost all of the change came from removing things, not adding them.

What actually shipped

In a single afternoon this week, the loop opened a stream of PRs against Chalie. Most were reviewed and merged by a human the same day. Nothing about the research paper we were emulating changed. What changed was our implementation.

Here is what we did, in rough order of impact.

1. Collapsed the pipeline into a skill

The old loop was a ~850-line Python pipeline: database tracking, opportunity scoring, retry logic, branch orchestration, targeted test runners, baseline capture, worktree management. It worked. It was also the thing hiding the part that actually mattered — Opus looking at a single failure and picking a single fix.

We replaced almost all of it with a slash command. /improve-chalie is now a skill inside Claude Code. The Python wrapper shrank to about 140 lines whose only job is to keep a persistent Opus session alive and re-invoke the skill forever. Every behavioral decision — how to pick an opportunity, when to bail, how to evaluate a result, when to open a PR — lives in the prompt.

The effect compounds. New behaviors now ship as prompt edits, not code changes. When a cycle goes wrong, you fix the skill, not the orchestrator. The loop improving the loop is the real flywheel — and a 140-line kernel is what makes that flywheel spin.

2. Hardened the evals

The previous post called out ±5% noise per run as a fundamental blocker — improvements smaller than the noise floor got randomly accepted or rejected. That wasn’t actually a property of LLM evaluation; it was a property of our evaluation. We rewrote the scenario runner and the judge to be far more deterministic, cut the per-scenario latency substantially, and pinned inputs that were drifting between runs.

The nightly suite is now both faster and far more stable. When a score moves, it’s because something moved.

3. Simplified the codebase

Once the orchestration logic was gone, a lot of supporting machinery followed it out the door. Opportunity scoring tables, retry queues, branch prep scripts, the separate baseline-capture path — all deleted. Every deletion removed a class of bug and a source of drift between “what the loop thinks happened” and “what actually happened.”

The codebase is smaller than it was when we started iterating on the meta-harness. The capability is larger.

4. Persistent session across restarts

The loop tracks a Claude Code session ID on disk. Crashes, machine reboots, context compaction — none of them reset Opus’s working memory. The self-improvement thread is genuinely continuous across days. The model can learn from its own history instead of starting cold every cycle.

5. Full traces instead of summaries

This is the single biggest quality lever, and it matches what the published meta-harness research found. We added get_scenario_trace — Opus can pull the full turn-by-turn transcript of any scenario: every tool call, every memory lookup, every agent text block, plus the judge’s verdict. Previously it was reasoning from a one-line diagnostic.

The difference is diagnosing from a headline versus diagnosing from the X-ray. Opportunities proposed from traces are specific. They name the function, the branch, the missing check. They stop being “try adding more context” and start being “this tool is called but its output is dropped on line 214.”

6. Deterministic attempt ledger

A JSONL record of every cycle — opportunity, root-cause category, files touched, scores before and after, anomalies before and after, outcome, PR URL. MCP memory handles semantic context; the ledger handles “has this exact angle already been tried?” Together they stop the loop from re-exploring the same dead ends. Over N cycles, this is the difference between drifting and converging.

7. CI-aware PR flow

After opening a PR, Opus now waits for all CI checks to finish, reads linter output (ruff, vulture, sonarqube), and keeps pushing fixes until the pipeline is green. A submitted PR is a merge-ready PR, not a draft that needs human babysitting. This is why the review-and-merge rate went up so sharply — the reviewer isn’t spending their first five minutes asking for lint fixes.

8. Base-branch flexibility

Nothing hardcodes main anymore. The loop follows whatever branch is active, or $BASE_BRANCH when supplied. We can now point it at an experimental branch, let it run overnight, and inspect the results in the morning without it stomping on the main line of development.

9. Dynamic container ports and a QA env

The scenario runner can bind to a Docker-assigned port instead of a fixed one, which unblocks concurrent environments. On top of that we added a QA-env API and UI so a fresh Chalie container can be spun up on any branch for manual inspection. It closes the loop between automated scoring and human eyes — when a PR lands, a reviewer can poke at the running system it came from in one click.

Why the outcome improved so much

Three forces compounded.

Quality of evidence. Opus went from grading based on the judge’s one-line summary to reading the actual tool calls and memory lookups that produced the failure. Proposals got specific. Specific proposals are easier to implement correctly and easier to verify.

Memory across cycles. Persistent session plus semantic memory plus deterministic ledger means the loop stops repeating itself. Every failed approach permanently narrows the search space.

Simplicity of the kernel. When orchestration is 140 lines and everything else is a prompt, a single skill edit changes behavior across the entire self-improvement process. Improvements ship as fast as we can think them.

The old post ended with a line about the loop needing “better targeting, fewer wasted passes, and a coder agent that can actually execute on what the research agent discovers.” We got there by deleting code, not writing more.

The system went from an interesting experiment that mostly produced noise to a continuously-running contributor whose PRs land. It’s still early — we have a lot of ideas about how to stretch it further — but the shape of the thing is now clear, and the shape is small.