# Design Decisions ## Stack - **Language:** Rust - **TUI Framework:** Ratatui + Crossterm - **Async Runtime:** Tokio ## Architecture - Channel boundary between TUI and core (fully decoupled) - Module decomposition: `app`, `tui`, `core`, `provider`, `tools`, `sandbox`, `session` - Headless mode: core without TUI, driven by script (enables benchmarking and CI) ## Model Integration - Claude-first, multi-model via `ModelProvider` trait - Common `StreamEvent` internal representation across providers - Prompt caching-aware message construction ## UI - **Agent view:** Tree-based hierarchy (not flat tabs) for sub-agent inspection - **Modes:** Normal, Insert, Command (`:` prefix from Normal mode) - **Activity modes:** Plan and Execute are visually distinct activities in the TUI - **Streaming:** Barebones styled text initially, full markdown rendering deferred - **Token usage:** Per-turn display (between user inputs), cumulative in status bar - **Status bar:** Mode indicator, current activity (Plan/Execute), token totals, network policy state ## Planning Mode - Distinct activity from execution — planner agent produces a plan file, does not execute - Plan file is structured markdown: steps with descriptions, files involved, acceptance criteria - Plan is reviewable and editable before execution (`:edit-plan` opens `$EDITOR`) - User explicitly approves plan before execution begins - Executor agent receives the plan file + project context, not the planning conversation - Plan-step progress tracked during execution (complete/in-progress/failed) ## Sub-Agents - Independent context windows with summary passed back to parent - Fully autonomous once spawned - Hard deny on unpermitted actions - Plan executor is a specialized sub-agent where the plan replaces the summary - Direct user interaction with sub-agents deferred ## Tool System - Built-in tool system with `Tool` trait - Core tools: `read_file`, `write_file`, `edit_file`, `shell_exec`, `list_directory`, `search_files` - Approval gates by risk level: auto-approve (reads), confirm (writes/shell), deny - MCP not implemented but interface designed to allow future adapter ## Sandboxing - **Landlock** (Linux kernel-level): - Read-only: system-wide (`/`) - Read-write: project directory, temp directory - Network: blocked by default, toggleable via `:net on/off` - Graceful degradation on older kernels - All tool execution goes through `Sandbox` — tools never touch filesystem directly ## Session Logging - JSONL format, one event per line - Events: user message, assistant message, tool call, tool result, sub-agent spawn/result, plan created, plan step status - Tree-addressable via parent IDs (enables conversation branching later) - Token usage stored per event - Linear UX for now, branching deferred ## Testing Strategy ### Unit Tests - **`provider`:** SSE stream parsing from byte fixtures, message/tool serialization, `StreamEvent` variant correctness - **`tools`:** Path canonicalization, traversal prevention, risk level classification, registry dispatch - **`sandbox`:** Landlock policy construction, path validation logic (without applying kernel rules) - **`core`:** Conversation tree operations (insert, query by parent, turn computation, token totals), orchestrator state machine transitions against mock `StreamEvent` sequences - **`session`:** JSONL serialization roundtrips, parent ID chain reconstruction - **`tui`:** Widget rendering via Ratatui `TestBackend` ### Integration Tests — Component Boundaries - **Core ↔ Provider:** Mock `ModelProvider` replaying recorded API sessions (full SSE streams with tool use). Tests the complete orchestration loop deterministically without network. - **Core ↔ TUI (channel boundary):** Orchestrator with mock provider connected to channels. Assert correct `UIEvent` sequence, inject `UserAction` messages, verify approval/denial flow. - **Tools ↔ Sandbox:** Real file operations and shell commands in temp directories. Verify write confinement, path traversal rejection, network blocking. Skip Landlock-specific tests on older kernels in CI. ### Integration Tests — End to End - **Recorded session replay:** Capture real Claude API HTTP request/response pairs, replay deterministically. Exercises full stack (core + channel + mock TUI) without cost or network dependency. Primary E2E test strategy. - **Live API tests:** Small suite behind feature flag / env var. Verifies real API integration. Run manually before releases, not in CI. ### Benchmarking — SWE-bench - **Target:** SWE-bench Verified (500 curated problems) as primary benchmark - **Secondary:** SWE-bench Pro for testing planning mode on longer-horizon tasks - **Approach:** Headless mode (core without TUI) produces unified diff patches, evaluated via SWE-bench Docker harness - **Baseline:** mini-swe-agent (~100 lines Python, >74% on Verified) as calibration — if we score significantly below with same model, the issue is scaffolding - **Cadence:** Milestone checks, not continuous CI (too expensive/slow) - **Requirements:** x86_64, 120GB+ storage, 16GB RAM, 8 CPU cores ### Test Sequencing - Phase 1: Unit tests for SSE parser, event types, message serialization - Phase 4: Recorded session replay infrastructure (core loop complex enough to warrant it) - Phase 6-7: Headless mode + first SWE-bench Verified run ## Configuration (Deferred) - Single-user, hardcoded defaults for now - Designed for later: global config, per-project `.agent.toml`, configurable keybindings ## Deferred Features - Conversation branching (tree structure in log, linear UX for now) - Direct sub-agent interaction - MCP adapter - Full markdown/syntax-highlighted rendering - Session log viewer - Per-project configuration - Structured plan editor in TUI (use `$EDITOR` for now)