Skip to main content

Command Palette

Search for a command to run...

The Case for Lightweight Coding Agents: Performance Matters in RSI

Why recursive self-improvement changes the performance calculus for coding agents

Published
5 min read
U
I'm building payment rails for agent-to-agent payments

The Hacker News front page this week turned into a benchmark war. Claude Code uses 1GB of RAM. Codex runs on Rust at 80MB. OpenCode hit 921 points. Developers argued about whether TypeScript is disqualifying for a coding agent, whether Rust is overkill, whether memory footprint even matters when your laptop has 32GB of RAM.

For one-shot coding tasks, most of this debate is noise. Your agent runs, generates code, exits. Whether it used 80MB or 1GB during that 30-second window doesn't change the output.

But for RSI - Recursive Self-Improvement - the performance debate isn't noise. It's the core design constraint.

Why RSI Agents Can't Be Heavy

An RSI agent doesn't run once. It runs continuously. It measures its own performance, hypothesizes improvements, mutates its own code, tests the mutations, and either applies or discards them. Then it does it again.

Each cycle involves:

  • Reading and analyzing its own source code
  • Generating candidate modifications
  • Running the modified version in a sandboxed environment
  • Comparing performance metrics between original and modified versions
  • Persisting the winner and starting the next cycle

If each cycle consumes 1GB of RAM and takes 45 seconds to cold-start, you hit practical limits fast. Running two versions side-by-side for comparison testing means 2GB just for the agent instances. Add the actual workload (trading signals, API calls, data processing) and you're competing with the production system for resources.

A lighter agent means more mutation cycles per hour. More cycles means faster improvement. Faster improvement is the entire point of RSI.

Determinism Matters More Than Speed

Raw execution speed is less important than deterministic behavior. Here's why:

When an RSI agent tests a mutation, it needs to compare "before" and "after" performance. If the agent's behavior varies between runs due to non-deterministic factors - garbage collection pauses, async scheduling differences, memory pressure from other processes - the comparison is unreliable.

A mutation that looks like a 3% improvement might just be a lucky GC cycle. A mutation that looks harmful might have been tested during a memory-pressure event. Non-determinism poisons the feedback loop.

Languages and runtimes that offer more predictable performance characteristics produce cleaner RSI signals:

  • Predictable memory allocation (no GC pauses during critical measurement windows)
  • Consistent startup time (cold-start variance confounds cycle-time metrics)
  • Low overhead for process spawning (each test cycle may spawn a fresh instance)

This doesn't automatically mean "use Rust." It means: understand your runtime's variance profile and design around it. A well-configured Node.js process with fixed heap allocation can be more deterministic than a naive Rust binary doing excessive dynamic allocation.

What We Learned Building RSI Into a Trading Agent

Our BTC perpetual trading agent runs RSI cycles against live market data. The self-improver evaluates strategy parameters, proposes mutations, and tests them against historical performance before applying them to live trading.

Three lessons from doing this in production:

1. Cold-start time dominates cycle latency. Each RSI cycle spawns a fresh agent instance to test mutations in isolation. If cold-start takes 10 seconds, and you want 100 test cycles per evaluation window, that's 16 minutes just on startup. We cut cold-start to under 2 seconds by pre-loading only the modules each test needs.

2. Memory leaks are RSI killers. A small leak that doesn't matter in a 30-second one-shot task becomes critical when the agent runs for days. Our early versions leaked 2MB per RSI cycle. After 500 cycles, the agent was consuming 1GB of leaked memory on top of its working set. We added memory snapshots between cycles and hard-kill any instance that exceeds its baseline by more than 10%.

3. The test harness needs to be lighter than the agent. If your test framework consumes more resources than the thing being tested, your measurements are contaminated. We stripped our RSI test runner down to pure metric collection - no logging, no tracing, no debugging output during measurement windows.

The Architecture Tradeoff

There's a real tension between "lightweight for RSI" and "feature-rich for productivity." Claude Code's 1GB footprint buys you a rich development environment with deep context windows, sophisticated tool use, and extensive runtime capabilities. That's genuinely valuable for the work the agent does.

The question is whether you can separate the execution layer (which should be light and deterministic) from the development layer (which can be as heavy as it needs to be).

Our approach: the RSI engine is a separate module that drops into any agent. The agent itself can be as heavy as it wants. When an RSI cycle triggers, the engine extracts the relevant code, runs mutations in a minimal sandbox, and reports results back. The heavy agent never runs during measurement.

This separation means you don't have to choose between a powerful agent and efficient self-improvement. You can have the 1GB development environment AND the 80MB test runner. The RSI engine only cares about the test runner's footprint.

What the HN Thread Got Right

The performance debate isn't wrong - it's incomplete. Memory footprint, startup time, and language choice all matter. They just matter differently depending on what the agent is doing.

For one-shot coding: optimize for capability. Use the heaviest, most feature-rich agent you can. The resource cost is amortized over a single task.

For continuous operation: optimize for reliability. The agent needs to run for days without degradation. Memory management and process isolation matter more than raw speed.

For RSI: optimize for determinism and test throughput. The agent needs to evaluate hundreds of mutations efficiently, with clean signals. Lightweight, predictable execution of test cycles is the bottleneck.

The teams building coding agents aren't wrong to use different languages and architectures. They're optimizing for different points on this spectrum. The mistake is assuming one point is correct for all use cases.

This article was written with AI assistance. All technical claims, code, and architectural decisions were validated by the author.

The Case for Lightweight Coding Agents: Performance Matters in RSI