Adds a status line indicating which mode the user is in. Adds a "normal" mode with keyboard shortcuts (including a chorded shortcut 'gg'). Adds a command mode with several basic commands that can be entered into an overlay. Chores: - Cleans up design/claude/plan.md to avoid confusing claude. - Adds some TODOs based on claude feedback.` Reviewed-on: #2 Co-authored-by: Drew Galbraith <drew@tiramisu.one> Co-committed-by: Drew Galbraith <drew@tiramisu.one>
105 lines
5.7 KiB
Markdown
105 lines
5.7 KiB
Markdown
# Design Decisions
|
|
|
|
## Stack
|
|
- **Language:** Rust
|
|
- **TUI Framework:** Ratatui + Crossterm
|
|
- **Async Runtime:** Tokio
|
|
|
|
## Architecture
|
|
- Channel boundary between TUI and core (fully decoupled)
|
|
- Module decomposition: `app`, `tui`, `core`, `provider`, `tools`, `sandbox`, `session`
|
|
- Headless mode: core without TUI, driven by script (enables benchmarking and CI)
|
|
|
|
## Model Integration
|
|
- Claude-first, multi-model via `ModelProvider` trait
|
|
- Common `StreamEvent` internal representation across providers
|
|
- Prompt caching-aware message construction
|
|
|
|
## UI
|
|
- **Agent view:** Tree-based hierarchy (not flat tabs) for sub-agent inspection
|
|
- **Modes:** Normal, Insert, Command (`:` prefix from Normal mode)
|
|
- **Activity modes:** Plan and Execute are visually distinct activities in the TUI
|
|
- **Streaming:** Barebones styled text initially, full markdown rendering deferred
|
|
- **Token usage:** Per-turn display (between user inputs), cumulative in status bar
|
|
- **Status bar:** Mode indicator, current activity (Plan/Execute), token totals, network policy state
|
|
|
|
## Planning Mode
|
|
- Distinct activity from execution — planner agent produces a plan file, does not execute
|
|
- Plan file is structured markdown: steps with descriptions, files involved, acceptance criteria
|
|
- Plan is reviewable and editable before execution (`:edit-plan` opens `$EDITOR`)
|
|
- User explicitly approves plan before execution begins
|
|
- Executor agent receives the plan file + project context, not the planning conversation
|
|
- Plan-step progress tracked during execution (complete/in-progress/failed)
|
|
|
|
## Sub-Agents
|
|
- Independent context windows with summary passed back to parent
|
|
- Fully autonomous once spawned
|
|
- Hard deny on unpermitted actions
|
|
- Plan executor is a specialized sub-agent where the plan replaces the summary
|
|
- Direct user interaction with sub-agents deferred
|
|
|
|
## Tool System
|
|
- Built-in tool system with `Tool` trait
|
|
- Core tools: `read_file`, `write_file`, `edit_file`, `shell_exec`, `list_directory`, `search_files`
|
|
- Approval gates by risk level: auto-approve (reads), confirm (writes/shell), deny
|
|
- MCP not implemented but interface designed to allow future adapter
|
|
|
|
## Sandboxing
|
|
- **Landlock** (Linux kernel-level):
|
|
- Read-only: system-wide (`/`)
|
|
- Read-write: project directory, temp directory
|
|
- Network: blocked by default, toggleable via `:net on/off`
|
|
- Graceful degradation on older kernels
|
|
- All tool execution goes through `Sandbox` — tools never touch filesystem directly
|
|
|
|
## Session Logging
|
|
- JSONL format, one event per line
|
|
- Events: user message, assistant message, tool call, tool result, sub-agent spawn/result, plan created, plan step status
|
|
- Tree-addressable via parent IDs (enables conversation branching later)
|
|
- Token usage stored per event
|
|
- Linear UX for now, branching deferred
|
|
|
|
## Testing Strategy
|
|
|
|
### Unit Tests
|
|
- **`provider`:** SSE stream parsing from byte fixtures, message/tool serialization, `StreamEvent` variant correctness
|
|
- **`tools`:** Path canonicalization, traversal prevention, risk level classification, registry dispatch
|
|
- **`sandbox`:** Landlock policy construction, path validation logic (without applying kernel rules)
|
|
- **`core`:** Conversation tree operations (insert, query by parent, turn computation, token totals), orchestrator state machine transitions against mock `StreamEvent` sequences
|
|
- **`session`:** JSONL serialization roundtrips, parent ID chain reconstruction
|
|
- **`tui`:** Widget rendering via Ratatui `TestBackend`
|
|
|
|
### Integration Tests — Component Boundaries
|
|
- **Core ↔ Provider:** Mock `ModelProvider` replaying recorded API sessions (full SSE streams with tool use). Tests the complete orchestration loop deterministically without network.
|
|
- **Core ↔ TUI (channel boundary):** Orchestrator with mock provider connected to channels. Assert correct `UIEvent` sequence, inject `UserAction` messages, verify approval/denial flow.
|
|
- **Tools ↔ Sandbox:** Real file operations and shell commands in temp directories. Verify write confinement, path traversal rejection, network blocking. Skip Landlock-specific tests on older kernels in CI.
|
|
|
|
### Integration Tests — End to End
|
|
- **Recorded session replay:** Capture real Claude API HTTP request/response pairs, replay deterministically. Exercises full stack (core + channel + mock TUI) without cost or network dependency. Primary E2E test strategy.
|
|
- **Live API tests:** Small suite behind feature flag / env var. Verifies real API integration. Run manually before releases, not in CI.
|
|
|
|
### Benchmarking — SWE-bench
|
|
- **Target:** SWE-bench Verified (500 curated problems) as primary benchmark
|
|
- **Secondary:** SWE-bench Pro for testing planning mode on longer-horizon tasks
|
|
- **Approach:** Headless mode (core without TUI) produces unified diff patches, evaluated via SWE-bench Docker harness
|
|
- **Baseline:** mini-swe-agent (~100 lines Python, >74% on Verified) as calibration — if we score significantly below with same model, the issue is scaffolding
|
|
- **Cadence:** Milestone checks, not continuous CI (too expensive/slow)
|
|
- **Requirements:** x86_64, 120GB+ storage, 16GB RAM, 8 CPU cores
|
|
|
|
### Test Sequencing
|
|
- Phase 1: Unit tests for SSE parser, event types, message serialization
|
|
- Phase 4: Recorded session replay infrastructure (core loop complex enough to warrant it)
|
|
- Phase 6-7: Headless mode + first SWE-bench Verified run
|
|
|
|
## Configuration (Deferred)
|
|
- Single-user, hardcoded defaults for now
|
|
- Designed for later: global config, per-project `.agent.toml`, configurable keybindings
|
|
|
|
## Deferred Features
|
|
- Conversation branching (tree structure in log, linear UX for now)
|
|
- Direct sub-agent interaction
|
|
- MCP adapter
|
|
- Full markdown/syntax-highlighted rendering
|
|
- Session log viewer
|
|
- Per-project configuration
|
|
- Structured plan editor in TUI (use `$EDITOR` for now)
|