Designing a Long-Running Autonomous Agent Harness

May 23, 2026 · 15 min read · Neo Project

Modern LLMs can reason, write code, search the web, control browsers, and execute shell commands. But orchestrating all of these capabilities into a reliable, long-running autonomous loop remains the hard problem. Most demos show a single prompt → tool-call cycle. Production harnesses must run for hours or days, adapt to context limits, recover from tool failures, and complete complex multi-step tasks with minimal human intervention.

This post distills our experience building Neo, an open-source Rust harness designed for exactly this problem. We cover: context discovery and budget management, CLI-first web search, browser control via CDP, shell integration, and the goal-oriented supervisor loop that ties everything together with minimal token waste.

Optimal Context Discovery
CLI-First Web Search
Browser Control
Shell Access
Goal-Oriented Harness & Minimal Toolset
The Supervisor Loop
Remaining Challenges

1. Optimal Context Discovery

The single biggest bottleneck in agentic systems is the context window. Every token you feed the model costs latency, money, and reasoning quality. Context discovery is the discipline of deciding what to include — and more importantly, what to exclude.

The Layered Context Model

We use a three-tier approach. The core layer is always present: the system prompt (agent identity, tool schemas, operating rules) plus a concise task summary from the previous checkpoint. The working layer changes every cycle: recent tool outputs, the last N messages of conversation, and a compressed diff of file changes. The archive layer is a sliding window of checkpoints — full conversation snapshots stored on disk, injected only on explicit demand or during supervisor re-engagement.

Token Budgeting

Each agent cycle begins with a budget: max_tokens - current_usage - safety_margin. Tools are ordered by expected information density. File reads and grep results go first; verbose command outputs and browser screenshots go last. If the budget is exceeded, the system truncates the lowest-utility content (typically repeated tool call histories) rather than cutting the task prompt.

Context Window Projection

Before sending a prompt, we estimate whether the model's context window will overflow. This is surprisingly tricky because models count tokens differently. Our solution: maintain a lightweight token counter per provider (cl100k_base for OpenAI, a rough character heuristic for others), and keep a 10% margin below the declared limit. If overflow is predicted, we checkpoint — serialize the full state to disk, start a fresh cycle with a compressed summary, and let the new cycle continue where the last left off.

Semantic Conflict Resolution

When state is reloaded from checkpoint, different parts of the context may contradict each other — a file was modified after it was read, a tool result describes a different directory state, etc. The harness runs a lightweight conflict scan before injection: timestamps on file reads vs. writes, tool call order vs. result order, and checksum verification for any file the agent has touched. Conflicts are surfaced to the model as "state divergence" warnings rather than silently patched.

Key insight: Context discovery is not a one-time prompt optimization. It's a continuously running garbage collector that must execute between every agent cycle, pruning what the model doesn't need and compacting what it does.

2. CLI-First Web Search

Web research is the most common long-running task for autonomous agents. The challenge: most "web search" integration in agent frameworks routes through a single search_web function that hides all the complexity. That's fine for demos, but fails in production when you need multi-page crawling, structured data extraction, or domain-specific searches.

CLI-First Architecture

Rather than wrapping a search API inside a Rust function, we expose search as a CLI tool that the agent invokes via shell. The tool is a thin wrapper around a configurable search provider (DuckDuckGo by default, with fallback to a local index). This has three advantages:

Composability — The agent can pipe search results through grep, jq, or a summarizer script without additional tool definitions.
Auditability — Every search call is logged as a shell command with its full output, enabling replay and debugging.
Extensibility — Adding a new search provider is a config change, not a code change. The agent uses --provider google --api-key $KEY "query".

Multi-Turn Search Results

A single search query rarely answers a complex question. The agent issues an initial search, reads the top results via URL fetching, identifies gaps, and issues follow-up searches. Each search result includes: title, snippet, URL, and the fetchable content from that URL (up to a configurable limit). The context budget determines how many URLs are actually fetched — the agent sees snippet-only results if the budget is tight.

Search Result Extraction

Raw HTML is terrible context. We convert all fetched pages to markdown before injection, stripping navigation, ads, and boilerplate. The conversion is destructive but intentional: the model gets clean text, not angle brackets. For pages where structure matters (tables, code blocks), we preserve semantic formatting in the markdown output.

3. Browser Control

Web search covers what is on the page; browser control covers what happens when you interact with it. For agentic tasks, this means: form filling, JavaScript-rendered content, multi-step workflows (login → search → extract), and visual verification via screenshots.

CDP-Driven Architecture

We drive the browser through the Chrome DevTools Protocol (CDP) rather than Playwright or Puppeteer. CDP is the minimal wire protocol — a WebSocket connection to a running Chrome instance. The harness exposes high-level actions as named tools: browser_navigate, browser_click, browser_type, browser_screenshot, browser_evaluate (JavaScript).

The key design choice: every browser action is synchronous and deterministic from the agent's perspective. The harness waits for the page to settle (network idle + no pending mutations) before returning control. This eliminates the "wait" problem where agents blindly call sleep() between actions.

Visual Context via Screenshot Compression

Screenshots are the highest-bandwidth context channel — and the most expensive. A single 1920×1080 screenshot as a base64 PNG can consume 50,000+ tokens. Our approach:

Capture at a fixed viewport (1280×720) to keep images consistent.
Compress to JPEG at 60% quality before base64 encoding.
Allow the agent to request element-specific screenshots (e.g., crop to a specific CSS selector) which are smaller and higher-value.
Cache screenshots server-side; the model gets a data-ref token unless it explicitly asks for the image data.

Reconnection and State Recovery

Browser sessions are long-lived. Chrome crashes, network drops, and OOM kills happen. The harness reconnects transparently: on detecting a broken WebSocket, it spawns a new Chrome instance, restores session cookies from disk, and replays a stabilization script to return the browser to its prior state. The agent sees a "connection recovered" message but does not lose its context.

4. Shell Access

Shell access is the most powerful and most dangerous tool in the harness. It gives the agent unbounded capability — and unbounded surface area for mistakes. The design must maximize utility while minimizing blast radius.

Isolated Execution Environment

Every shell command runs inside a persistent session (a dedicated container or worktree) with strict resource limits. The agent sees a clean filesystem, a non-root user, no network access to internal services, and a time limit on each command. Long-running commands are killed after the timeout; the agent receives the partial output and a termination signal.

Output Truncation and Streaming

Shell command output can be megabytes. We use a three-tier output model:

Inline (≤2 KB): Returned directly in the tool response.
Paged (2 KB – 100 KB): Returned with a ViewMore link. The agent can request additional pages via a pagination tool.
Archived (>100 KB): Written to disk; the agent gets a file:// path and must read chunks via the file read tool.

Dangerous Command Detection

Rather than a fixed blocklist (which agents learn to bypass), we use a risk scoring system that considers: (1) the command being run, (2) the arguments, (3) the working directory, and (4) whether the task prompt permits destructive operations. High-risk commands (e.g., rm -rf /, DROP TABLE, wallet send) require explicit confirmation from the human operator. Medium-risk commands (e.g., git push, docker rm) trigger a warning that the agent sees but can override.

5. Goal-Oriented Harness & Minimal Toolset

The most important architectural decision: every tool must justify its existence by enabling a goal that cannot be achieved with existing tools alone. Tool proliferation is the enemy of reliable agents. Each additional tool increases the decision surface, complicates the system prompt, and creates edge cases.

The Minimal Tool Philosophy

Neo exposes fewer than 15 core tools. The principle: prefer composition over specialization. A generic bash tool replaces the need for read_file, write_file, list_dir, grep, diff, install_package, etc. — all of these are shell commands the agent already knows. A single web_search tool replaces search_docs, lookup_api, find_tutorial, etc.

The exception is browser control, which cannot be composed from shell tools (you cannot curl a JavaScript SPA). Browser actions are a separate category with their own tools. Even here, we keep it minimal: navigate, click, type, screenshot, evaluate JS.

Tool Schema Contracts

Every tool has an explicit contract that the harness enforces:

Input schema: Strict JSON Schema with required/optional fields.
Output schema: The predictable shape of the result (e.g., always { success: bool, output: string, error?: string }).
Side effects: Declared up front — does this tool modify files? Send network requests? Require human approval?
Cost: Token cost estimate for the tool's output, so the budget manager can plan ahead.

6. The Supervisor Loop

The supervisor loop (named Ralph in Neo) is the top-level orchestrator that decides when to invoke the agent and what to do with its output. It is not an LLM call — it's a deterministic state machine running outside the context window.

Cycle Lifecycle

Check: Verify agent state is consistent (checkpoints valid, tools responsive, context budget adequate).
Build: Assemble the context from the layered model (core + working + archive if needed). Run conflict resolution.
Invoke: Call the LLM with the assembled context and tool schemas. Apply the token budget.
Dispatch: Route the model's response — if it calls a tool, execute it and feed the result back as a new message. If it produces a final answer, save it.
Checkpoint: Serialize the full conversation state to disk. Store the task summary for the next cycle.
Decide: Should we continue (task not done), checkpoint-rotate (context window full), or escalate (stuck in a loop, too many errors)?

Checkpoint Rotation

When the context window approaches its limit, the supervisor does not truncate — it rotates. The full conversation (including all tool outputs) is saved to a numbered checkpoint. A new cycle starts with just the system prompt + task summary + "You have been working on this task. Here's what you've done so far: [summary]." The agent continues as if it just picked up the work — it can read previous checkpoints via a read_checkpoint tool if needed.

Loop Detection and Escalation

The supervisor tracks a hash of each agent response. If it sees the same response (or a semantically similar one) more than twice, it flags a stall. The first stall restarts the cycle with a modified system prompt: "You appear to be repeating yourself. Try a different approach." The second stall triggers an escalation: the supervisor pauses the agent and asks the human operator for guidance.

7. Remaining Challenges

The harness works well for most software engineering tasks, but several problems remain unsolved:

Context Composition is Still Manual

The layered context model works, but the boundaries between layers are hand-tuned per task. We'd like a system that learns which context is important by observing which parts of the context the model actually references in its responses.

Tool Failure Recovery

When a tool call fails (network error, timeout, bad parameters), the current approach is to retry with exponential backoff. But some failures require re-planning — the agent should choose a different implementation strategy rather than retrying the same call. Distinguishing transient failures from strategy failures is an open problem.

Multi-Agent Coordination

The harness supports multiple workers via git worktrees, but coordination between them is primitive (shared task queue, manual handoff). True multi-agent collaboration — where agents debate, review each other's code, and merge results — requires a shared protocol that we haven't built yet.

Deterministic Replay

Because LLM outputs are non-deterministic, replaying a checkpoint does not guarantee the same result. We can replay the tool execution deterministically (same inputs → same outputs), but the agent's next decision depends on the model's sampled output. Checkpoint replay is useful for debugging but not for exact reproduction.

The full source code for the Neo harness is available on GitHub. It's written in Rust, licensed under MIT, and we welcome contributions. The docsite at docs.epsilondelta.tech has detailed API documentation and deployment guides.