Design/Plan/Claude.md from Claude.

2026-02-23 21:39:31 -08:00 · 2026-02-23 21:39:31 -08:00 · 42e3ddacc2
commit 42e3ddacc2
5 changed files with 310 additions and 0 deletions
--- a/DESIGN.md
+++ b/DESIGN.md
@ -0,0 +1,111 @@
+# Design Decisions
+
+## Stack
+- **Language:** Rust
+- **TUI Framework:** Ratatui + Crossterm
+- **Async Runtime:** Tokio
+
+## Architecture
+- Channel boundary between TUI and core (fully decoupled)
+- Module decomposition: `app`, `tui`, `core`, `provider`, `tools`, `sandbox`, `session`
+- Headless mode: core without TUI, driven by script (enables benchmarking and CI)
+
+## Model Integration
+- Claude-first, multi-model via `ModelProvider` trait
+- Common `StreamEvent` internal representation across providers
+- Prompt caching-aware message construction
+
+## UI
+- **Agent view:** Tree-based hierarchy (not flat tabs) for sub-agent inspection
+- **Modes:** Normal, Insert, Command (`:` prefix from Normal mode)
+- **Activity modes:** Plan and Execute are visually distinct activities in the TUI
+- **Streaming:** Barebones styled text initially, full markdown rendering deferred
+- **Token usage:** Per-turn display (between user inputs), cumulative in status bar
+- **Status bar:** Mode indicator, current activity (Plan/Execute), token totals, network policy state
+
+## Planning Mode
+- Distinct activity from execution — planner agent produces a plan file, does not execute
+- Plan file is structured markdown: steps with descriptions, files involved, acceptance criteria
+- Plan is reviewable and editable before execution (`:edit-plan` opens `$EDITOR`)
+- User explicitly approves plan before execution begins
+- Executor agent receives the plan file + project context, not the planning conversation
+- Plan-step progress tracked during execution (complete/in-progress/failed)
+
+## Sub-Agents
+- Independent context windows with summary passed back to parent
+- Fully autonomous once spawned
+- Hard deny on unpermitted actions
+- Plan executor is a specialized sub-agent where the plan replaces the summary
+- Direct user interaction with sub-agents deferred
+
+## Tool System
+- Built-in tool system with `Tool` trait
+- Core tools: `read_file`, `write_file`, `edit_file`, `shell_exec`, `list_directory`, `search_files`
+- Approval gates by risk level: auto-approve (reads), confirm (writes/shell), deny
+- MCP not implemented but interface designed to allow future adapter
+
+## Sandboxing
+- **Landlock** (Linux kernel-level):
+  - Read-only: system-wide (`/`)
+  - Read-write: project directory, temp directory
+  - Network: blocked by default, toggleable via `:net on/off`
+- Graceful degradation on older kernels
+- All tool execution goes through `Sandbox` — tools never touch filesystem directly
+
+## Session Logging
+- JSONL format, one event per line
+- Events: user message, assistant message, tool call, tool result, sub-agent spawn/result, plan created, plan step status
+- Tree-addressable via parent IDs (enables conversation branching later)
+- Token usage stored per event
+- Linear UX for now, branching deferred
+
+## Testing Strategy
+
+### Unit Tests
+- **`provider`:** SSE stream parsing from byte fixtures, message/tool serialization, `StreamEvent` variant correctness
+- **`tools`:** Path canonicalization, traversal prevention, risk level classification, registry dispatch
+- **`sandbox`:** Landlock policy construction, path validation logic (without applying kernel rules)
+- **`core`:** Conversation tree operations (insert, query by parent, turn computation, token totals), orchestrator state machine transitions against mock `StreamEvent` sequences
+- **`session`:** JSONL serialization roundtrips, parent ID chain reconstruction
+- **`tui`:** Widget rendering via Ratatui `TestBackend`, snapshot tests with `insta` crate for layout/mode indicator/token display
+
+### Integration Tests — Component Boundaries
+- **Core ↔ Provider:** Mock `ModelProvider` replaying recorded API sessions (full SSE streams with tool use). Tests the complete orchestration loop deterministically without network.
+- **Core ↔ TUI (channel boundary):** Orchestrator with mock provider connected to channels. Assert correct `UIEvent` sequence, inject `UserAction` messages, verify approval/denial flow.
+- **Tools ↔ Sandbox:** Real file operations and shell commands in temp directories. Verify write confinement, path traversal rejection, network blocking. Skip Landlock-specific tests on older kernels in CI.
+
+### Integration Tests — End to End
+- **Recorded session replay:** Capture real Claude API HTTP request/response pairs, replay deterministically. Exercises full stack (core + channel + mock TUI) without cost or network dependency. Primary E2E test strategy.
+- **Live API tests:** Small suite behind feature flag / env var. Verifies real API integration. Run manually before releases, not in CI.
+
+### Snapshot Testing
+- `insta` crate for TUI visual regression testing from Phase 2 onward
+- Capture rendered `TestBackend` buffers as string snapshots
+- Catches layout, mode indicator, and token display regressions
+
+### Benchmarking — SWE-bench
+- **Target:** SWE-bench Verified (500 curated problems) as primary benchmark
+- **Secondary:** SWE-bench Pro for testing planning mode on longer-horizon tasks
+- **Approach:** Headless mode (core without TUI) produces unified diff patches, evaluated via SWE-bench Docker harness
+- **Baseline:** mini-swe-agent (~100 lines Python, >74% on Verified) as calibration — if we score significantly below with same model, the issue is scaffolding
+- **Cadence:** Milestone checks, not continuous CI (too expensive/slow)
+- **Requirements:** x86_64, 120GB+ storage, 16GB RAM, 8 CPU cores
+
+### Test Sequencing
+- Phase 1: Unit tests for SSE parser, event types, message serialization
+- Phase 2: Snapshot tests for TUI with `insta`
+- Phase 4: Recorded session replay infrastructure (core loop complex enough to warrant it)
+- Phase 6-7: Headless mode + first SWE-bench Verified run
+
+## Configuration (Deferred)
+- Single-user, hardcoded defaults for now
+- Designed for later: global config, per-project `.agent.toml`, configurable keybindings
+
+## Deferred Features
+- Conversation branching (tree structure in log, linear UX for now)
+- Direct sub-agent interaction
+- MCP adapter
+- Full markdown/syntax-highlighted rendering
+- Session log viewer
+- Per-project configuration
+- Structured plan editor in TUI (use `$EDITOR` for now)