5.7 KiB
5.7 KiB
Design Decisions
Stack
- Language: Rust
- TUI Framework: Ratatui + Crossterm
- Async Runtime: Tokio
Architecture
- Channel boundary between TUI and core (fully decoupled)
- Module decomposition:
app,tui,core,provider,tools,sandbox,session - Headless mode: core without TUI, driven by script (enables benchmarking and CI)
Model Integration
- Claude-first, multi-model via
ModelProvidertrait - Common
StreamEventinternal representation across providers - Prompt caching-aware message construction
UI
- Agent view: Tree-based hierarchy (not flat tabs) for sub-agent inspection
- Modes: Normal, Insert, Command (
:prefix from Normal mode) - Activity modes: Plan and Execute are visually distinct activities in the TUI
- Streaming: Barebones styled text initially, full markdown rendering deferred
- Token usage: Per-turn display (between user inputs), cumulative in status bar
- Status bar: Mode indicator, current activity (Plan/Execute), token totals, network policy state
Planning Mode
- Distinct activity from execution — planner agent produces a plan file, does not execute
- Plan file is structured markdown: steps with descriptions, files involved, acceptance criteria
- Plan is reviewable and editable before execution (
:edit-planopens$EDITOR) - User explicitly approves plan before execution begins
- Executor agent receives the plan file + project context, not the planning conversation
- Plan-step progress tracked during execution (complete/in-progress/failed)
Sub-Agents
- Independent context windows with summary passed back to parent
- Fully autonomous once spawned
- Hard deny on unpermitted actions
- Plan executor is a specialized sub-agent where the plan replaces the summary
- Direct user interaction with sub-agents deferred
Tool System
- Built-in tool system with
Tooltrait - Core tools:
read_file,write_file,edit_file,shell_exec,list_directory,search_files - Approval gates by risk level: auto-approve (reads), confirm (writes/shell), deny
- MCP not implemented but interface designed to allow future adapter
Sandboxing
- Landlock (Linux kernel-level):
- Read-only: system-wide (
/) - Read-write: project directory, temp directory
- Network: blocked by default, toggleable via
:net on/off
- Read-only: system-wide (
- Graceful degradation on older kernels
- All tool execution goes through
Sandbox— tools never touch filesystem directly
Session Logging
- JSONL format, one event per line
- Events: user message, assistant message, tool call, tool result, sub-agent spawn/result, plan created, plan step status
- Tree-addressable via parent IDs (enables conversation branching later)
- Token usage stored per event
- Linear UX for now, branching deferred
Testing Strategy
Unit Tests
provider: SSE stream parsing from byte fixtures, message/tool serialization,StreamEventvariant correctnesstools: Path canonicalization, traversal prevention, risk level classification, registry dispatchsandbox: Landlock policy construction, path validation logic (without applying kernel rules)core: Conversation tree operations (insert, query by parent, turn computation, token totals), orchestrator state machine transitions against mockStreamEventsequencessession: JSONL serialization roundtrips, parent ID chain reconstructiontui: Widget rendering via RatatuiTestBackend
Integration Tests — Component Boundaries
- Core ↔ Provider: Mock
ModelProviderreplaying recorded API sessions (full SSE streams with tool use). Tests the complete orchestration loop deterministically without network. - Core ↔ TUI (channel boundary): Orchestrator with mock provider connected to channels. Assert correct
UIEventsequence, injectUserActionmessages, verify approval/denial flow. - Tools ↔ Sandbox: Real file operations and shell commands in temp directories. Verify write confinement, path traversal rejection, network blocking. Skip Landlock-specific tests on older kernels in CI.
Integration Tests — End to End
- Recorded session replay: Capture real Claude API HTTP request/response pairs, replay deterministically. Exercises full stack (core + channel + mock TUI) without cost or network dependency. Primary E2E test strategy.
- Live API tests: Small suite behind feature flag / env var. Verifies real API integration. Run manually before releases, not in CI.
Benchmarking — SWE-bench
- Target: SWE-bench Verified (500 curated problems) as primary benchmark
- Secondary: SWE-bench Pro for testing planning mode on longer-horizon tasks
- Approach: Headless mode (core without TUI) produces unified diff patches, evaluated via SWE-bench Docker harness
- Baseline: mini-swe-agent (~100 lines Python, >74% on Verified) as calibration — if we score significantly below with same model, the issue is scaffolding
- Cadence: Milestone checks, not continuous CI (too expensive/slow)
- Requirements: x86_64, 120GB+ storage, 16GB RAM, 8 CPU cores
Test Sequencing
- Phase 1: Unit tests for SSE parser, event types, message serialization
- Phase 4: Recorded session replay infrastructure (core loop complex enough to warrant it)
- Phase 6-7: Headless mode + first SWE-bench Verified run
Configuration (Deferred)
- Single-user, hardcoded defaults for now
- Designed for later: global config, per-project
.agent.toml, configurable keybindings
Deferred Features
- Conversation branching (tree structure in log, linear UX for now)
- Direct sub-agent interaction
- MCP adapter
- Full markdown/syntax-highlighted rendering
- Session log viewer
- Per-project configuration
- Structured plan editor in TUI (use
$EDITORfor now)