Update the design and PLAN.md (#11)

Reviewed-on: #11 Co-authored-by: Drew Galbraith <drew@tiramisu.one> Co-committed-by: Drew Galbraith <drew@tiramisu.one>
2026-03-14 21:52:38 +00:00 · 2026-03-14 21:52:38 +00:00 · 7420755800
commit 7420755800
parent 669e05b716
2 changed files with 707 additions and 170 deletions
--- a/DESIGN.md
+++ b/DESIGN.md
@ -1,19 +1,96 @@
-# Design Decisions
+# Skate Design
+
+This is a TUI coding agent harness built for one user. The unique design goals compared
+to other coding agents are:
+
+1) Allow autonomous execution without permission prompts without fully sacrificing security.
+The user can configure what permissions the coding agent has before execution and these
+are enforced using kernel-level sandboxing.
+
+2) The UI supports introspection to better understand how the harness is performing.
+Information may start collapsed, but it is possible to introspect things like tool uses
+and thinking chains. Additionally token usage is elevated to understand where the harness
+is performing inefficiently.
+
+3) The UI is modal and supports neovim like hotkeys for navigation and configuratiorn
+(i.e. using the space bar as a leader key). We prefer having hotkeys over adding custom
+slash commands (/model) to the text chat interface. The text chat should be reserved for
+things that go straight to the underlying model.

-## Stack
- **Language:** Rust
- **TUI Framework:** Ratatui + Crossterm
- **Async Runtime:** Tokio

 ## Architecture
- Channel boundary between TUI and core (fully decoupled)
- Module decomposition: `app`, `tui`, `core`, `provider`, `tools`, `sandbox`, `session`
- Headless mode: core without TUI, driven by script (enables benchmarking and CI)

-## Model Integration
- Claude-first, multi-model via `ModelProvider` trait
- Common `StreamEvent` internal representation across providers
- Prompt caching-aware message construction
+The coding agent is broken into three main components, the TUI, the harness, and the tool executor.
+
+The harness communicates with the tool executor via a tarpc interface.
+
+The TUI and harness communicate over a Channel boundary and are fully decoupled
+in a way that supports running the harness without the TUI (i.e. in scripting mode).
+
+## Harness Design
+
+The harness follows a fairly straightforward design loop.
+
+1. Send message to underlying model.
+2. If model requests a tool use, execute it (via a call to the executor) and return to 1.
+3. Else, wait for further user input.
+
+### Harness Instantiation
+
+The harness is instantiated with a system prompt and a tarpc client to the tool executor.
+(In the first iteration we use an in process channel for the tarpc client).
+
+### Model Integration
+
+The harness uses a trait system to make it agnostic to the underlying coding agent used.
+
+This trait unifies a variety of APIs using a `StreamEvent` interface for streaming responses 
+from the API.
+
+Currently, only Anthropic's Claude API is supported.
+
+Messages are constructed in such a way to support prompt caching when available.
+
+### Session Logging
+- JSONL format, one event per line
+- Events: user message, assistant message, tool call, tool result.
+- Tree-addressable via parent IDs (enables conversation branching later)
+- Token usage stored per event
+- Linear UX for now, branching deferred
+
+## Executor Design
+
+The key aspect of the executor design is that is configured with sandbox permissions
+that allow tool use without any user prompting. Either the tool use succeeds within the
+sandbox and is returned to the model or it fails with a permission error to the model.
+
+The sandboxing allows running arbitrary shell commands without prompting.
+
+### Executor Interface
+
+The executor interface exposed to the harness has the following methods.
+
+- list_available_tools: takes no arguments and returns tool names, descriptions, and argument schema.
+- call_tool: takes a tool name and its arguments and returns either a result or an error.
+
+### Sandboxing
+
+Sandboxing is done using the linux kernel feature "Landlock".
+
+This allows restricting file system access (either read only, read/write, or no access)
+as well as network access (either on/off).
+
+## TUI Design
+
+The bulk of the complexity of this coding agent is pushed to TUI in this design.
+
+The driving goals of the TUI are:
+
+- Support (neo)vim style keyboard navigation and modal editing.
+- Full progressive discloure of information, high level information is grokable at a glance 
+  but full tool use and thinking traces can be expanded.
+- Support for instantiating multiple different instances of the core harness (i.e. different 
+  instantiations for code review vs planning vs implementation).

 ## UI
 - **Agent view:** Tree-based hierarchy (not flat tabs) for sub-agent inspection
@ -24,12 +101,17 @@
 - **Status bar:** Mode indicator, current activity (Plan/Execute), token totals, network policy state

 ## Planning Mode
- Distinct activity from execution — planner agent produces a plan file, does not execute
- Plan file is structured markdown: steps with descriptions, files involved, acceptance criteria
- Plan is reviewable and editable before execution (`:edit-plan` opens `$EDITOR`)
- User explicitly approves plan before execution begins
- Executor agent receives the plan file + project context, not the planning conversation
- Plan-step progress tracked during execution (complete/in-progress/failed)
+
+In planning mode the TUI instantiates a harness with read access to the project directory
+and write access to a single plan markdown file.
+
+The TUI then provides a glue mechanism that can then pipe that plan into a new instantiation of the
+harness in execute mode.
+
+Additionally we specify a schema for "surveys" that allow the model to ask the user questions about
+the plan.
+
+We also provide a hotkey (Ctrl+G or :edit-plan) that allows opening the plan in the users `$EDITOR`.

 ## Sub-Agents
 - Independent context windows with summary passed back to parent
@ -38,68 +120,3 @@
 - Plan executor is a specialized sub-agent where the plan replaces the summary
 - Direct user interaction with sub-agents deferred

-## Tool System
- Built-in tool system with `Tool` trait
- Core tools: `read_file`, `write_file`, `edit_file`, `shell_exec`, `list_directory`, `search_files`
- Approval gates by risk level: auto-approve (reads), confirm (writes/shell), deny
- MCP not implemented but interface designed to allow future adapter
-
-## Sandboxing
- **Landlock** (Linux kernel-level):
-  - Read-only: system-wide (`/`)
-  - Read-write: project directory, temp directory
-  - Network: blocked by default, toggleable via `:net on/off`
- Graceful degradation on older kernels
- All tool execution goes through `Sandbox` — tools never touch filesystem directly
-
-## Session Logging
- JSONL format, one event per line
- Events: user message, assistant message, tool call, tool result, sub-agent spawn/result, plan created, plan step status
- Tree-addressable via parent IDs (enables conversation branching later)
- Token usage stored per event
- Linear UX for now, branching deferred
-
-## Testing Strategy
-
-### Unit Tests
- **`provider`:** SSE stream parsing from byte fixtures, message/tool serialization, `StreamEvent` variant correctness
- **`tools`:** Path canonicalization, traversal prevention, risk level classification, registry dispatch
- **`sandbox`:** Landlock policy construction, path validation logic (without applying kernel rules)
- **`core`:** Conversation tree operations (insert, query by parent, turn computation, token totals), orchestrator state machine transitions against mock `StreamEvent` sequences
- **`session`:** JSONL serialization roundtrips, parent ID chain reconstruction
- **`tui`:** Widget rendering via Ratatui `TestBackend`
-
-### Integration Tests — Component Boundaries
- **Core ↔ Provider:** Mock `ModelProvider` replaying recorded API sessions (full SSE streams with tool use). Tests the complete orchestration loop deterministically without network.
- **Core ↔ TUI (channel boundary):** Orchestrator with mock provider connected to channels. Assert correct `UIEvent` sequence, inject `UserAction` messages, verify approval/denial flow.
- **Tools ↔ Sandbox:** Real file operations and shell commands in temp directories. Verify write confinement, path traversal rejection, network blocking. Skip Landlock-specific tests on older kernels in CI.
-
-### Integration Tests — End to End
- **Recorded session replay:** Capture real Claude API HTTP request/response pairs, replay deterministically. Exercises full stack (core + channel + mock TUI) without cost or network dependency. Primary E2E test strategy.
- **Live API tests:** Small suite behind feature flag / env var. Verifies real API integration. Run manually before releases, not in CI.
-
-### Benchmarking — SWE-bench
- **Target:** SWE-bench Verified (500 curated problems) as primary benchmark
- **Secondary:** SWE-bench Pro for testing planning mode on longer-horizon tasks
- **Approach:** Headless mode (core without TUI) produces unified diff patches, evaluated via SWE-bench Docker harness
- **Baseline:** mini-swe-agent (~100 lines Python, >74% on Verified) as calibration — if we score significantly below with same model, the issue is scaffolding
- **Cadence:** Milestone checks, not continuous CI (too expensive/slow)
- **Requirements:** x86_64, 120GB+ storage, 16GB RAM, 8 CPU cores
-
-### Test Sequencing
- Phase 1: Unit tests for SSE parser, event types, message serialization
- Phase 4: Recorded session replay infrastructure (core loop complex enough to warrant it)
- Phase 6-7: Headless mode + first SWE-bench Verified run
-
-## Configuration (Deferred)
- Single-user, hardcoded defaults for now
- Designed for later: global config, per-project `.agent.toml`, configurable keybindings
-
-## Deferred Features
- Conversation branching (tree structure in log, linear UX for now)
- Direct sub-agent interaction
- MCP adapter
- Full markdown/syntax-highlighted rendering
- Session log viewer
- Per-project configuration
- Structured plan editor in TUI (use `$EDITOR` for now)