From 742075580093fd96b9155f4a23cbac448c975317 Mon Sep 17 00:00:00 2001 From: Drew Galbraith Date: Sat, 14 Mar 2026 21:52:38 +0000 Subject: [PATCH] Update the design and PLAN.md (#11) Reviewed-on: https://git.tiramisu.one/drew/skate/pulls/11 Co-authored-by: Drew Galbraith Co-committed-by: Drew Galbraith --- DESIGN.md | 183 +++++++------- PLAN.md | 694 +++++++++++++++++++++++++++++++++++++++++++++++------- 2 files changed, 707 insertions(+), 170 deletions(-) diff --git a/DESIGN.md b/DESIGN.md index bdbe061..c98165d 100644 --- a/DESIGN.md +++ b/DESIGN.md @@ -1,19 +1,96 @@ -# Design Decisions +# Skate Design + +This is a TUI coding agent harness built for one user. The unique design goals compared +to other coding agents are: + +1) Allow autonomous execution without permission prompts without fully sacrificing security. +The user can configure what permissions the coding agent has before execution and these +are enforced using kernel-level sandboxing. + +2) The UI supports introspection to better understand how the harness is performing. +Information may start collapsed, but it is possible to introspect things like tool uses +and thinking chains. Additionally token usage is elevated to understand where the harness +is performing inefficiently. + +3) The UI is modal and supports neovim like hotkeys for navigation and configuratiorn +(i.e. using the space bar as a leader key). We prefer having hotkeys over adding custom +slash commands (/model) to the text chat interface. The text chat should be reserved for +things that go straight to the underlying model. -## Stack -- **Language:** Rust -- **TUI Framework:** Ratatui + Crossterm -- **Async Runtime:** Tokio ## Architecture -- Channel boundary between TUI and core (fully decoupled) -- Module decomposition: `app`, `tui`, `core`, `provider`, `tools`, `sandbox`, `session` -- Headless mode: core without TUI, driven by script (enables benchmarking and CI) -## Model Integration -- Claude-first, multi-model via `ModelProvider` trait -- Common `StreamEvent` internal representation across providers -- Prompt caching-aware message construction +The coding agent is broken into three main components, the TUI, the harness, and the tool executor. + +The harness communicates with the tool executor via a tarpc interface. + +The TUI and harness communicate over a Channel boundary and are fully decoupled +in a way that supports running the harness without the TUI (i.e. in scripting mode). + +## Harness Design + +The harness follows a fairly straightforward design loop. + +1. Send message to underlying model. +2. If model requests a tool use, execute it (via a call to the executor) and return to 1. +3. Else, wait for further user input. + +### Harness Instantiation + +The harness is instantiated with a system prompt and a tarpc client to the tool executor. +(In the first iteration we use an in process channel for the tarpc client). + +### Model Integration + +The harness uses a trait system to make it agnostic to the underlying coding agent used. + +This trait unifies a variety of APIs using a `StreamEvent` interface for streaming responses +from the API. + +Currently, only Anthropic's Claude API is supported. + +Messages are constructed in such a way to support prompt caching when available. + +### Session Logging +- JSONL format, one event per line +- Events: user message, assistant message, tool call, tool result. +- Tree-addressable via parent IDs (enables conversation branching later) +- Token usage stored per event +- Linear UX for now, branching deferred + +## Executor Design + +The key aspect of the executor design is that is configured with sandbox permissions +that allow tool use without any user prompting. Either the tool use succeeds within the +sandbox and is returned to the model or it fails with a permission error to the model. + +The sandboxing allows running arbitrary shell commands without prompting. + +### Executor Interface + +The executor interface exposed to the harness has the following methods. + +- list_available_tools: takes no arguments and returns tool names, descriptions, and argument schema. +- call_tool: takes a tool name and its arguments and returns either a result or an error. + +### Sandboxing + +Sandboxing is done using the linux kernel feature "Landlock". + +This allows restricting file system access (either read only, read/write, or no access) +as well as network access (either on/off). + +## TUI Design + +The bulk of the complexity of this coding agent is pushed to TUI in this design. + +The driving goals of the TUI are: + +- Support (neo)vim style keyboard navigation and modal editing. +- Full progressive discloure of information, high level information is grokable at a glance + but full tool use and thinking traces can be expanded. +- Support for instantiating multiple different instances of the core harness (i.e. different + instantiations for code review vs planning vs implementation). ## UI - **Agent view:** Tree-based hierarchy (not flat tabs) for sub-agent inspection @@ -24,12 +101,17 @@ - **Status bar:** Mode indicator, current activity (Plan/Execute), token totals, network policy state ## Planning Mode -- Distinct activity from execution — planner agent produces a plan file, does not execute -- Plan file is structured markdown: steps with descriptions, files involved, acceptance criteria -- Plan is reviewable and editable before execution (`:edit-plan` opens `$EDITOR`) -- User explicitly approves plan before execution begins -- Executor agent receives the plan file + project context, not the planning conversation -- Plan-step progress tracked during execution (complete/in-progress/failed) + +In planning mode the TUI instantiates a harness with read access to the project directory +and write access to a single plan markdown file. + +The TUI then provides a glue mechanism that can then pipe that plan into a new instantiation of the +harness in execute mode. + +Additionally we specify a schema for "surveys" that allow the model to ask the user questions about +the plan. + +We also provide a hotkey (Ctrl+G or :edit-plan) that allows opening the plan in the users `$EDITOR`. ## Sub-Agents - Independent context windows with summary passed back to parent @@ -38,68 +120,3 @@ - Plan executor is a specialized sub-agent where the plan replaces the summary - Direct user interaction with sub-agents deferred -## Tool System -- Built-in tool system with `Tool` trait -- Core tools: `read_file`, `write_file`, `edit_file`, `shell_exec`, `list_directory`, `search_files` -- Approval gates by risk level: auto-approve (reads), confirm (writes/shell), deny -- MCP not implemented but interface designed to allow future adapter - -## Sandboxing -- **Landlock** (Linux kernel-level): - - Read-only: system-wide (`/`) - - Read-write: project directory, temp directory - - Network: blocked by default, toggleable via `:net on/off` -- Graceful degradation on older kernels -- All tool execution goes through `Sandbox` — tools never touch filesystem directly - -## Session Logging -- JSONL format, one event per line -- Events: user message, assistant message, tool call, tool result, sub-agent spawn/result, plan created, plan step status -- Tree-addressable via parent IDs (enables conversation branching later) -- Token usage stored per event -- Linear UX for now, branching deferred - -## Testing Strategy - -### Unit Tests -- **`provider`:** SSE stream parsing from byte fixtures, message/tool serialization, `StreamEvent` variant correctness -- **`tools`:** Path canonicalization, traversal prevention, risk level classification, registry dispatch -- **`sandbox`:** Landlock policy construction, path validation logic (without applying kernel rules) -- **`core`:** Conversation tree operations (insert, query by parent, turn computation, token totals), orchestrator state machine transitions against mock `StreamEvent` sequences -- **`session`:** JSONL serialization roundtrips, parent ID chain reconstruction -- **`tui`:** Widget rendering via Ratatui `TestBackend` - -### Integration Tests — Component Boundaries -- **Core ↔ Provider:** Mock `ModelProvider` replaying recorded API sessions (full SSE streams with tool use). Tests the complete orchestration loop deterministically without network. -- **Core ↔ TUI (channel boundary):** Orchestrator with mock provider connected to channels. Assert correct `UIEvent` sequence, inject `UserAction` messages, verify approval/denial flow. -- **Tools ↔ Sandbox:** Real file operations and shell commands in temp directories. Verify write confinement, path traversal rejection, network blocking. Skip Landlock-specific tests on older kernels in CI. - -### Integration Tests — End to End -- **Recorded session replay:** Capture real Claude API HTTP request/response pairs, replay deterministically. Exercises full stack (core + channel + mock TUI) without cost or network dependency. Primary E2E test strategy. -- **Live API tests:** Small suite behind feature flag / env var. Verifies real API integration. Run manually before releases, not in CI. - -### Benchmarking — SWE-bench -- **Target:** SWE-bench Verified (500 curated problems) as primary benchmark -- **Secondary:** SWE-bench Pro for testing planning mode on longer-horizon tasks -- **Approach:** Headless mode (core without TUI) produces unified diff patches, evaluated via SWE-bench Docker harness -- **Baseline:** mini-swe-agent (~100 lines Python, >74% on Verified) as calibration — if we score significantly below with same model, the issue is scaffolding -- **Cadence:** Milestone checks, not continuous CI (too expensive/slow) -- **Requirements:** x86_64, 120GB+ storage, 16GB RAM, 8 CPU cores - -### Test Sequencing -- Phase 1: Unit tests for SSE parser, event types, message serialization -- Phase 4: Recorded session replay infrastructure (core loop complex enough to warrant it) -- Phase 6-7: Headless mode + first SWE-bench Verified run - -## Configuration (Deferred) -- Single-user, hardcoded defaults for now -- Designed for later: global config, per-project `.agent.toml`, configurable keybindings - -## Deferred Features -- Conversation branching (tree structure in log, linear UX for now) -- Direct sub-agent interaction -- MCP adapter -- Full markdown/syntax-highlighted rendering -- Session log viewer -- Per-project configuration -- Structured plan editor in TUI (use `$EDITOR` for now) diff --git a/PLAN.md b/PLAN.md index 7ecb254..388d2da 100644 --- a/PLAN.md +++ b/PLAN.md @@ -1,96 +1,616 @@ -# Implementation Plan +# Skate Implementation Plan -## Phase 4: Sandboxing +This plan closes the gaps between the current codebase and the goals stated in DESIGN.md. +The phases are ordered by dependency -- each phase builds on the previous. -### Step 4.1: Create sandbox module with policy types and tracing foundation -- `SandboxPolicy` struct: read-only paths, read-write paths, network allowed bool -- `Sandbox` struct holding policy + working dir -- Add `tracing` spans and events throughout from the start: - - `#[instrument]` on all public `Sandbox` methods - - `debug!` on policy construction with path lists - - `info!` on sandbox creation with full policy summary -- No enforcement yet, just the type skeleton and module wiring -- **Files:** new `src/sandbox/mod.rs`, `src/sandbox/policy.rs` -- **Done when:** compiles, unit tests for policy construction, `RUST_LOG=debug cargo test` shows sandbox trace output +## Current State Summary -### Step 4.2: Landlock policy builder with startup gate and tracing -- Translate `SandboxPolicy` into Landlock ruleset using `landlock` crate -- Kernel requirements: - - **ABI v4 (kernel 6.7+):** minimum required -- provides both filesystem and network sandboxing - - ABI 1-3 have filesystem only, no network restriction -- tools could exfiltrate data freely -- Startup behavior -- on launch, check Landlock ABI version: - - ABI >= 4: proceed normally (full filesystem + network sandboxing) - - ABI < 4 (including unsupported): **refuse to start** with clear error: "Landlock ABI v4+ required (kernel 6.7+). Use --yolo to run without sandboxing." - - `--yolo` flag: skip all Landlock enforcement, log `warn!` at startup, show "UNSANDBOXED" in status bar permanently -- Landlock applied per-child-process via `pre_exec`, NOT to the main process - - Main process needs unrestricted network (Claude API) and filesystem (provider) - - Each `exec_command` child gets the current policy at spawn time - - `:net on/off` takes effect on the next spawned command -- Tracing: - - `info!` on kernel ABI version detected - - `debug!` for each rule added to ruleset (path, access flags) - - `warn!` on `--yolo` mode ("running without kernel sandboxing") - - `error!` if ruleset creation fails unexpectedly -- **Files:** `src/sandbox/landlock.rs`, add `landlock` dep to `Cargo.toml`, update CLI args in `src/app/` -- **Done when:** unit test constructs ruleset without panic; `--yolo` flag works on unsupported kernel; startup refuses without flag on unsupported kernel +Phase 0 (core loop) is functionally complete: the TUI renders conversations, the +orchestrator drives the Claude API, tools execute inside a Landlock sandbox, and the +channel boundary between TUI and core is properly maintained. -### Step 4.3: Sandbox file I/O API with operation tracing -- `Sandbox::read_file`, `Sandbox::write_file`, `Sandbox::list_directory` -- Move `validate_path` from `src/tools/mod.rs` into sandbox -- Tracing: - - `debug!` on every file operation: requested path, canonical path, allowed/denied - - `trace!` for path validation steps (join, canonicalize, starts_with check) - - `warn!` on path escape attempts (log the attempted path for debugging) - - `debug!` on successful operations with bytes read/written -- **Files:** `src/sandbox/mod.rs` -- **Done when:** unit tests in tempdir pass; path traversal rejected; `RUST_LOG=trace` shows full path resolution chain +The major gaps are: -### Step 4.4: Sandbox command execution with process tracing -- `Sandbox::exec_command(cmd, args, working_dir)` spawns child process with Landlock applied -- Captures stdout/stderr, enforces timeout -- Tracing: - - `info!` on command spawn: command, args, working_dir, timeout - - `debug!` on command completion: exit code, stdout/stderr byte lengths, duration - - `warn!` on non-zero exit codes - - `error!` on timeout or spawn failure with full context - - `trace!` for Landlock application to child process thread -- **Files:** `src/sandbox/mod.rs` or `src/sandbox/exec.rs` -- **Done when:** unit test runs `echo hello` in tempdir; write outside sandbox fails (on supported kernels) +1. Tool executor tarpc interface -- the orchestrator calls tools directly rather than + via a tarpc client/server split as DESIGN.md specifies. This is the biggest + structural gap and a prerequisite for sub-agents (each agent gets its own client). +2. Session logging (JSONL, tree-addressable) -- no `session/` module exists yet. +3. Token tracking -- counts are debug-logged but not surfaced to the user. +4. TUI introspection -- tool blocks and thinking traces cannot be expanded/collapsed. +5. Status bar is sparse -- no token totals, no activity mode, no network state badge. +6. Planning Mode -- no dedicated harness instantiation with restricted sandbox. +7. Sub-agents -- no spawning mechanism, no independent context windows. +8. Space-bar leader key and which-key help overlay are absent. -### Step 4.5: Wire tools through Sandbox -- Change `Tool::execute` signature to accept `&Sandbox` instead of (or in addition to) `&Path` -- Update all 4 built-in tools to call `Sandbox` methods instead of `std::fs`/`std::process::Command` -- Remove direct `std::fs` usage from tool implementations -- Update `ToolRegistry` and orchestrator to pass `Sandbox` -- Tracing: tools now inherit sandbox spans automatically via `#[instrument]` -- **Files:** `src/tools/*.rs`, `src/tools/mod.rs`, `src/core/orchestrator.rs` -- **Done when:** all existing tool tests pass through Sandbox; no direct `std::fs` in tool files; `RUST_LOG=debug cargo run` shows sandbox operations during tool execution +--- -### Step 4.6: Network toggle -- `network_allowed: bool` in `SandboxPolicy` -- `:net on/off` TUI command parsed in input handler, sent as `UserAction::SetNetworkPolicy(bool)` -- Orchestrator updates `Sandbox` policy. Status bar shows network state. -- Only available when Landlock ABI >= 4 (kernel 6.7+); command hidden otherwise -- Status bar shows: network state when available, "UNSANDBOXED" in `--yolo` mode -- Tracing: `info!` on network policy change -- **Files:** `src/tui/input.rs`, `src/tui/render.rs`, `src/core/types.rs`, `src/core/orchestrator.rs`, `src/sandbox/mod.rs` -- **Done when:** toggling `:net` updates status bar; Landlock network restriction applied on ABI >= 4 +## Phase 1 -- Tool Executor tarpc Interface -### Step 4.7: Integration tests -- Tools + Sandbox in tempdir: write confinement, path traversal rejection, shell command confinement -- Skip Landlock-specific assertions on ABI < 4 -- Test `--yolo` mode: sandbox constructed but no kernel enforcement -- Test startup gate: verify error on ABI < 4 without `--yolo` -- Tests should assert tracing output where relevant (use `tracing-test` crate or `tracing_subscriber::fmt::TestWriter`) -- **Files:** `tests/sandbox.rs` -- **Done when:** `cargo test --test sandbox` passes +**Goal:** Introduce the harness/executor split described in DESIGN.md. The executor +owns the `ToolRegistry` and `Sandbox`; the orchestrator (harness) communicates with +it exclusively through a tarpc client. In this phase the transport is in-process +(tarpc's unbounded channel pair), laying the groundwork for out-of-process execution +in a later phase. -### Phase 4 verification (end-to-end) -1. `cargo test` -- all tests pass -2. `cargo clippy -- -D warnings` -- zero warnings -3. `RUST_LOG=debug cargo run -- --project-dir .` -- ask Claude to read a file, observe sandbox trace logs showing path validation and Landlock policy -4. Ask Claude to write a file outside project dir -- sandbox denies with `warn!` log -5. Ask Claude to run a shell command -- observe command spawn/completion trace -6. `:net off` then ask for network access -- verify blocked -7. Without `--yolo` on ABI < 4: verify startup refuses with clear error -8. With `--yolo`: verify startup succeeds, "UNSANDBOXED" in status bar, `warn!` in logs +This is the largest structural change in the plan. Every subsequent phase benefits +from the cleaner boundary: sub-agents each get their own executor client (Phase 7), +and the sandbox policy becomes a constructor argument to the executor rather than +something threaded through the orchestrator. + +### 1.1 Define the tarpc service + +Create `src/executor/mod.rs`: + +```rust +#[tarpc::service] +pub trait Executor { + /// Return the full list of tools this executor exposes, including their + /// JSON Schema input descriptors. The harness calls this once at startup + /// and caches the result for the lifetime of the conversation. + async fn list_tools() -> Vec; + + /// Invoke a single tool by name with a JSON-encoded argument object. + /// Returns the text content to feed back to the model, or an error string + /// that is also fed back (so the model can self-correct). + async fn call_tool(name: String, input: serde_json::Value) -> Result; +} +``` + +`ToolDefinition` is already defined in `core/types.rs` and is provider-agnostic -- +no new types are needed on the wire. + +### 1.2 Implement `ExecutorServer` + +Still in `src/executor/mod.rs`, add: + +```rust +pub struct ExecutorServer { + registry: ToolRegistry, + sandbox: Arc, +} + +impl ExecutorServer { + pub fn new(registry: ToolRegistry, sandbox: Sandbox) -> Self { ... } +} + +impl Executor for ExecutorServer { + async fn list_tools(self, _: Context) -> Vec { + self.registry.definitions() + } + + async fn call_tool(self, _: Context, name: String, input: Value) -> Result { + match self.registry.get(&name) { + None => Err(format!("unknown tool: {name}")), + Some(tool) => tool + .execute(input, &self.sandbox) + .await + .map_err(|e| e.to_string()), + } + } +} +``` + +The `Arc` is required because tarpc clones the server struct per request. + +### 1.3 In-process transport helper + +Add a function to `src/executor/mod.rs` (and re-export from `src/app/mod.rs`) that +wires an `ExecutorServer` to a client over tarpc's in-memory channel: + +```rust +/// Spawn an ExecutorServer on the current tokio runtime and return a client +/// connected to it via an in-process channel. The server task runs until +/// the client is dropped. +pub fn spawn_local(server: ExecutorServer) -> ExecutorClient { + let (client_transport, server_transport) = tarpc::transport::channel::unbounded(); + let server = tarpc::server::BaseChannel::with_defaults(server_transport); + tokio::spawn(server.execute(ExecutorServer::serve(/* ... */))); + ExecutorClient::new(tarpc::client::Config::default(), client_transport).spawn() +} +``` + +### 1.4 Refactor `Orchestrator` to use the client + +Currently `Orchestrator

` holds `ToolRegistry` and `Sandbox` directly and calls +`tool.execute(input, &sandbox)` in `run_turn`. Replace these fields with: + +```rust +executor: ExecutorClient, +tool_definitions: Vec, // fetched once at construction +``` + +`run_turn` changes from direct tool dispatch to: + +```rust +let result = self.executor + .call_tool(context::current(), name, input) + .await; +``` + +The `tool_definitions` vec is passed to `provider.stream()` instead of being built +from the registry on each call. + +### 1.5 Update `app/mod.rs` + +Replace the inline construction of `ToolRegistry + Sandbox` in `app::run` with: + +```rust +let registry = build_tool_registry(); +let sandbox = Sandbox::new(policy, project_dir, enforcement)?; +let executor = executor::spawn_local(ExecutorServer::new(registry, sandbox)); +let orchestrator = Orchestrator::new(provider, executor, system_prompt); +``` + +### 1.6 Tests + +- Unit: `ExecutorServer::call_tool` with a mock `ToolRegistry` returns correct + output and maps errors to `Err(String)`. +- Integration: `spawn_local` -> `client.call_tool` round-trip through the in-process + channel executes a real `read_file` against a temp dir. +- Integration: existing orchestrator integration tests continue to pass after the + refactor (the mock provider path is unchanged; only tool dispatch changes). + +### 1.7 Files touched + +| Action | File | +|--------|------| +| New | `src/executor/mod.rs` | +| Modified | `src/core/orchestrator.rs` -- remove registry/sandbox, add executor client | +| Modified | `src/app/mod.rs` -- construct executor, pass client to orchestrator | +| Modified | `Cargo.toml` -- add `tarpc` with `tokio1` feature | + +New dependency: `tarpc` (with `tokio1` and `serde-transport` features). + +--- + +## Phase 2 -- Session Logging + +**Goal:** Persist every event to a JSONL file. This is the foundation for token +accounting, session resume, and future conversation branching. + +### 1.1 Add `src/session/` module + +Create `src/session/mod.rs` with the following public surface: + +```rust +pub struct SessionWriter { ... } + +impl SessionWriter { + /// Open (or create) a JSONL log at the given path in append mode. + pub async fn open(path: &Path) -> Result; + + /// Append one event. Never rewrites history. + pub async fn append(&self, event: &LogEvent) -> Result<(), SessionError>; +} + +pub struct SessionReader { ... } + +impl SessionReader { + pub async fn load(path: &Path) -> Result, SessionError>; +} +``` + +### 1.2 Define `LogEvent` + +```rust +pub struct LogEvent { + pub id: Uuid, + pub parent_id: Option, + pub timestamp: DateTime, + pub payload: LogPayload, + pub token_usage: Option, +} + +pub enum LogPayload { + UserMessage { content: String }, + AssistantMessage { content: Vec }, + ToolCall { tool_name: String, input: serde_json::Value }, + ToolResult { tool_use_id: String, content: String, is_error: bool }, +} + +pub struct TokenUsage { + pub input: u32, + pub output: u32, + pub cache_read: Option, + pub cache_write: Option, +} +``` + +`id` and `parent_id` form a tree that enables future branching. For now the +conversation is linear so `parent_id` is always the id of the previous event. + +### 1.3 Wire into Orchestrator + +- `Orchestrator` holds an `Option`. +- Every time the orchestrator pushes to `ConversationHistory` it also appends a + `LogEvent`. Token counts from `StreamEvent::InputTokens` / `OutputTokens` are + stored on the final assistant event of each turn. +- Session file lives at `.skate/sessions/.jsonl`. + +### 1.4 Tests + +- Unit: `SessionWriter::append` then `SessionReader::load` round-trips all payload + variants. +- Unit: parent_id chain is correct across a simulated multi-turn exchange. +- Integration: run the orchestrator with a mock provider against a temp dir; assert + the JSONL file is written. + +--- + +## Phase 3 -- Token Tracking & Status Bar + +**Goal:** Surface token usage in the TUI per-turn and cumulatively. + +### 3.1 Per-turn token counts in UIEvent + +Add a variant to `UIEvent`: + +```rust +UIEvent::TurnComplete { input_tokens: u32, output_tokens: u32 } +``` + +The orchestrator already receives `StreamEvent::InputTokens` and `OutputTokens`; +it should accumulate them during a turn and emit them in `TurnComplete`. + +### 3.2 AppState token counters + +Add to `AppState`: + +```rust +pub turn_input_tokens: u32, +pub turn_output_tokens: u32, +pub total_input_tokens: u64, +pub total_output_tokens: u64, +``` + +`events.rs` updates these on `TurnComplete`. + +### 3.3 Status bar redesign + +The status bar currently shows only the mode indicator. Expand it to four sections: + +``` +[ MODE ] [ ACTIVITY ] [ i:1234 o:567 | total i:9999 o:2345 ] [ NET: off ] +``` + +- **MODE** -- Normal / Insert / Command +- **ACTIVITY** -- Plan / Execute (Phase 4 adds Plan; for now always "Execute") +- **Tokens** -- per-turn input/output, then session cumulative +- **NET** -- `on` (green) or `off` (red) reflecting `network_allowed` + +Update `render.rs` to implement this layout using Ratatui `Layout::horizontal`. + +### 3.4 Tests + +- Unit: `AppState` accumulates totals correctly across multiple `TurnComplete` events. +- TUI snapshot test (TestBackend): status bar renders all four sections with correct + content after a synthetic `TurnComplete`. + +--- + +## Phase 4 -- TUI Introspection (Expand/Collapse) + +**Goal:** Support progressive disclosure -- tool calls and thinking traces start +collapsed; the user can expand them. + +### 4.1 Block model + +Replace the flat `Vec` in `AppState` with a `Vec`: + +```rust +pub enum DisplayBlock { + UserMessage { content: String }, + AssistantText { content: String }, + ToolCall { + display: ToolDisplay, + result: Option, + expanded: bool, + }, + Error { message: String }, +} +``` + +### 4.2 Navigation in Normal mode + +Add block-level cursor to `AppState`: + +```rust +pub focused_block: Option, +``` + +Keybindings (Normal mode): + +| Key | Action | +|-----|--------| +| `[` | Move focus to previous block | +| `]` | Move focus to next block | +| `Enter` or `Space` | Toggle `expanded` on focused ToolCall block | +| `j` / `k` | Line scroll (unchanged) | + +The focused block is highlighted with a distinct border color. + +### 4.3 Render changes + +`render.rs` must calculate the height of each `DisplayBlock` depending on whether +it is collapsed (1-2 summary lines) or expanded (full content). The scroll offset +operates on pixel-rows, not message indices. + +Collapsed tool call shows: `> tool_name(arg_summary) -- result_summary` +Expanded tool call shows: full input and output as formatted by `tool_display.rs`. + +### 4.4 Tests + +- Unit: toggling `expanded` on a `ToolCall` block changes height calculation. +- TUI snapshot: collapsed vs expanded render output for `WriteFile` and `ShellExec`. + +--- + +## Phase 5 -- Space-bar Leader Key & Which-Key Overlay + +**Goal:** Support vim-style `` leader chords for configuration actions. This +replaces the `:net on` / `:net off` text commands with discoverable hotkeys. + +### 5.1 Leader key state machine + +Extend `AppState` with: + +```rust +pub leader_active: bool, +pub leader_timeout: Option, +``` + +In Normal mode, pressing `Space` sets `leader_active = true` and starts a 1-second +timeout. The next key is dispatched through the chord table. If the timeout fires +or an unbound key is pressed, leader mode is cancelled with a brief status message. + +### 5.2 Initial chord table + +| Chord | Action | +|-------|--------| +| ` n` | Toggle network policy | +| ` c` | Clear history (`:clear`) | +| ` p` | Switch to Plan mode (Phase 5) | +| ` ?` | Toggle which-key overlay | + +### 5.3 Which-key overlay + +A centered popup rendered over the output pane that lists all available chords and +their descriptions. Rendered only when `leader_active = true` (after a short delay, +~200 ms, to avoid flicker on fast typists). + +### 5.4 Remove `:net on/off` from command parser + +Once leader-key network toggle is in place, remove the text-command duplicates to +keep the command palette small and focused. + +### 5.5 Tests + +- Unit: leader key state machine transitions (activate, timeout, chord match, cancel). +- TUI snapshot: which-key overlay renders with correct chord list. + +--- + +## Phase 6 -- Planning Mode + +**Goal:** A dedicated planning harness with restricted sandbox that writes a single +plan file, plus a mechanism to pipe the plan into an execute harness. + +### 6.1 Plan harness sandbox policy + +In planning mode the orchestrator is instantiated with a `SandboxPolicy` that grants: + +- `/` -- ReadOnly (same as execute) +- `/.skate/plan.md` -- ReadWrite (only this file) +- Network -- off + +All other write attempts fail with a sandbox permission error returned to the model. + +### 6.2 Survey tool + +Add a new tool `ask_user` that allows the model to present structured questions to +the user during planning: + +```rust +// Input schema +{ + "question": "string", + "options": ["string"] | null // null means free-text answer +} +``` + +The orchestrator sends a new `UIEvent::SurveyRequest { question, options }`. The TUI +renders an inline prompt. The user's answer is sent back as a `UserAction::SurveyResponse`. + +### 6.3 TUI activity mode + +`AppState` gets: + +```rust +pub activity: Activity, + +pub enum Activity { Plan, Execute } +``` + +Switching activity (via ` p`) instantiates a new orchestrator on a fresh +channel pair. The old orchestrator is shut down cleanly. The status bar ACTIVITY +section updates. + +### 6.4 Plan -> Execute handoff + +When the user is satisfied with the plan (` x` or `:exec`): + +1. TUI reads `.skate/plan.md`. +2. Constructs a new system prompt: `\n\n## Plan\n`. +3. Instantiates an Execute orchestrator with the full sandbox policy and the + augmented system prompt. +4. Transitions `activity` to `Execute`. + +The old Plan orchestrator is dropped. + +### 6.5 Edit plan in $EDITOR + +Hotkey ` e` (or `:edit-plan`) suspends the TUI (restores terminal), opens +`$EDITOR` on `.skate/plan.md`, then resumes the TUI after the editor exits. + +### 6.6 Tests + +- Integration: plan harness rejects write to a file other than plan.md. +- Integration: survey tool round-trip through channel boundary. +- Unit: plan -> execute handoff produces correct augmented system prompt. + +--- + +## Phase 7 -- Sub-Agents + +**Goal:** The model can spawn independent sub-agents with their own context windows. +Results are summarised and returned to the parent. + +### 7.1 `spawn_agent` tool + +Add a new tool with input schema: + +```rust +{ + "task": "string", // instruction for the sub-agent + "sandbox": { // optional policy overrides + "network": bool, + "extra_write_paths": ["string"] + } +} +``` + +### 7.2 Sub-agent lifecycle + +When `spawn_agent` executes: + +1. Create a new `Orchestrator` with an independent conversation history. +2. The sub-agent's system prompt is the parent's system prompt plus the task + description. +3. The sub-agent runs autonomously (no user interaction) until it emits a + `UserAction::Quit` equivalent or hits `MAX_TOOL_ITERATIONS`. +4. The final assistant message is returned as the tool result (the "summary"). +5. The sub-agent's session is logged to a child JSONL file linked to the parent + session by a `parent_session_id` field. + +### 7.3 TUI sub-agent view + +The agent tree is accessible via ` a`. A side panel shows: + +``` +Parent + +-- sub-agent 1 [running] + +-- sub-agent 2 [done] +``` + +Pressing Enter on a sub-agent opens a read-only replay of its conversation (scroll +only, no input). This is a stretch goal within this phase -- the core spawning +mechanism is the priority. + +### 7.4 Tests + +- Integration: spawn_agent with a mock provider runs to completion and returns a + summary string. +- Unit: sub-agent session file has correct parent_session_id link. +- Unit: MAX_TOOL_ITERATIONS limit is respected within sub-agents. + +In this phase `spawn_agent` gains a natural implementation: it calls +`executor::spawn_local` with a new `ExecutorServer` configured for the child policy, +constructs a new `Orchestrator` with that client, and runs it to completion. The +tarpc boundary from Phase 1 makes this straightforward. + +--- + +## Phase 8 -- Prompt Caching + +**Goal:** Use Anthropic's prompt caching to reduce cost and latency on long +conversations. DESIGN.md notes this as a desired property of message construction. + +### 8.1 Cache breakpoints + +The Anthropic API supports `"cache_control": {"type": "ephemeral"}` on message +content blocks. The optimal strategy is to mark the last user message of the longest +stable prefix as a cache write point. + +In `provider/claude.rs`, when serializing the messages array: + +- Mark the system prompt content block with `cache_control` (it never changes). +- Mark the penultimate user message with `cache_control` (the conversation history + that is stable for the current turn). + +### 8.2 Cache token tracking + +The `TokenUsage` struct in `session/` already reserves `cache_read` and +`cache_write` fields. `StreamEvent` must be extended: + +```rust +StreamEvent::CacheReadTokens(u32), +StreamEvent::CacheWriteTokens(u32), +``` + +The Anthropic `message_start` event contains `usage.cache_read_input_tokens` and +`usage.cache_creation_input_tokens`. Parse these and emit the new variants. + +### 8.3 Status bar update + +Add cache tokens to the status bar display: `i:1234(c:800) o:567`. + +### 8.4 Tests + +- Provider unit test: replay a fixture that contains cache token fields; assert the + new StreamEvent variants are emitted. +- Snapshot test: status bar renders cache token counts correctly. + +--- + +## Dependency Graph + +``` +Phase 1 (tarpc executor) + | + +-- Phase 2 (session logging) -- orchestrator refactor is complete + | | + | +-- Phase 3 (token tracking) -- requires session TokenUsage struct + | | + | +-- Phase 7 (sub-agents) -- requires session parent_session_id + | + +-- Phase 7 (sub-agents) -- spawn_local reuse is natural after Phase 1 + +Phase 4 (expand/collapse) -- independent, can be done alongside Phase 3 + +Phase 5 (leader key) -- independent, prerequisite for Phase 6 + +Phase 6 (planning mode) -- requires Phase 5 (leader key chord p) + -- benefits from Phase 1 (separate executor per activity) + +Phase 8 (prompt caching) -- requires Phase 3 (cache token display) +``` + +Recommended order: 1 -> 2 -> 3 -> 4 -> 5 -> 6 -> 8, with 7 after 2 and 6. + +--- + +## Files Touched Per Phase + +| Phase | New Files | Modified Files | +|-------|-----------|----------------| +| 1 | `src/executor/mod.rs` | `src/core/orchestrator.rs`, `src/core/types.rs`, `src/app/mod.rs`, `Cargo.toml` | +| 2 | `src/session/mod.rs` | `src/core/orchestrator.rs`, `src/app/mod.rs` | +| 3 | -- | `src/core/types.rs`, `src/core/orchestrator.rs`, `src/tui/events.rs`, `src/tui/render.rs` | +| 4 | -- | `src/tui/mod.rs`, `src/tui/render.rs`, `src/tui/events.rs`, `src/tui/input.rs` | +| 5 | -- | `src/tui/input.rs`, `src/tui/render.rs`, `src/tui/mod.rs` | +| 6 | `src/tools/ask_user.rs` | `src/core/types.rs`, `src/core/orchestrator.rs`, `src/tui/mod.rs`, `src/tui/input.rs`, `src/tui/render.rs`, `src/app/mod.rs` | +| 7 | -- | `src/executor/mod.rs`, `src/core/orchestrator.rs`, `src/tui/render.rs`, `src/tui/input.rs` | +| 8 | -- | `src/provider/claude.rs`, `src/core/types.rs`, `src/session/mod.rs`, `src/tui/render.rs` | + +--- + +## New Dependencies + +| Crate | Phase | Reason | +|-------|-------|--------| +| `tarpc` | 1 | RPC service trait + in-process transport | +| `uuid` | 2 | LogEvent ids | +| `chrono` | 2 | Event timestamps (check if already transitive) | + +No other new dependencies are needed. All other required functionality +(`serde_json`, `tokio`, `ratatui`, `tracing`) is already present.