Update the design and PLAN.md (#11)

Reviewed-on: #11
Co-authored-by: Drew Galbraith <drew@tiramisu.one>
Co-committed-by: Drew Galbraith <drew@tiramisu.one>
This commit is contained in:
Drew 2026-03-14 21:52:38 +00:00 committed by Drew
parent 669e05b716
commit 7420755800
2 changed files with 707 additions and 170 deletions

183
DESIGN.md
View file

@ -1,19 +1,96 @@
# Design Decisions
# Skate Design
This is a TUI coding agent harness built for one user. The unique design goals compared
to other coding agents are:
1) Allow autonomous execution without permission prompts without fully sacrificing security.
The user can configure what permissions the coding agent has before execution and these
are enforced using kernel-level sandboxing.
2) The UI supports introspection to better understand how the harness is performing.
Information may start collapsed, but it is possible to introspect things like tool uses
and thinking chains. Additionally token usage is elevated to understand where the harness
is performing inefficiently.
3) The UI is modal and supports neovim like hotkeys for navigation and configuratiorn
(i.e. using the space bar as a leader key). We prefer having hotkeys over adding custom
slash commands (/model) to the text chat interface. The text chat should be reserved for
things that go straight to the underlying model.
## Stack
- **Language:** Rust
- **TUI Framework:** Ratatui + Crossterm
- **Async Runtime:** Tokio
## Architecture
- Channel boundary between TUI and core (fully decoupled)
- Module decomposition: `app`, `tui`, `core`, `provider`, `tools`, `sandbox`, `session`
- Headless mode: core without TUI, driven by script (enables benchmarking and CI)
## Model Integration
- Claude-first, multi-model via `ModelProvider` trait
- Common `StreamEvent` internal representation across providers
- Prompt caching-aware message construction
The coding agent is broken into three main components, the TUI, the harness, and the tool executor.
The harness communicates with the tool executor via a tarpc interface.
The TUI and harness communicate over a Channel boundary and are fully decoupled
in a way that supports running the harness without the TUI (i.e. in scripting mode).
## Harness Design
The harness follows a fairly straightforward design loop.
1. Send message to underlying model.
2. If model requests a tool use, execute it (via a call to the executor) and return to 1.
3. Else, wait for further user input.
### Harness Instantiation
The harness is instantiated with a system prompt and a tarpc client to the tool executor.
(In the first iteration we use an in process channel for the tarpc client).
### Model Integration
The harness uses a trait system to make it agnostic to the underlying coding agent used.
This trait unifies a variety of APIs using a `StreamEvent` interface for streaming responses
from the API.
Currently, only Anthropic's Claude API is supported.
Messages are constructed in such a way to support prompt caching when available.
### Session Logging
- JSONL format, one event per line
- Events: user message, assistant message, tool call, tool result.
- Tree-addressable via parent IDs (enables conversation branching later)
- Token usage stored per event
- Linear UX for now, branching deferred
## Executor Design
The key aspect of the executor design is that is configured with sandbox permissions
that allow tool use without any user prompting. Either the tool use succeeds within the
sandbox and is returned to the model or it fails with a permission error to the model.
The sandboxing allows running arbitrary shell commands without prompting.
### Executor Interface
The executor interface exposed to the harness has the following methods.
- list_available_tools: takes no arguments and returns tool names, descriptions, and argument schema.
- call_tool: takes a tool name and its arguments and returns either a result or an error.
### Sandboxing
Sandboxing is done using the linux kernel feature "Landlock".
This allows restricting file system access (either read only, read/write, or no access)
as well as network access (either on/off).
## TUI Design
The bulk of the complexity of this coding agent is pushed to TUI in this design.
The driving goals of the TUI are:
- Support (neo)vim style keyboard navigation and modal editing.
- Full progressive discloure of information, high level information is grokable at a glance
but full tool use and thinking traces can be expanded.
- Support for instantiating multiple different instances of the core harness (i.e. different
instantiations for code review vs planning vs implementation).
## UI
- **Agent view:** Tree-based hierarchy (not flat tabs) for sub-agent inspection
@ -24,12 +101,17 @@
- **Status bar:** Mode indicator, current activity (Plan/Execute), token totals, network policy state
## Planning Mode
- Distinct activity from execution — planner agent produces a plan file, does not execute
- Plan file is structured markdown: steps with descriptions, files involved, acceptance criteria
- Plan is reviewable and editable before execution (`:edit-plan` opens `$EDITOR`)
- User explicitly approves plan before execution begins
- Executor agent receives the plan file + project context, not the planning conversation
- Plan-step progress tracked during execution (complete/in-progress/failed)
In planning mode the TUI instantiates a harness with read access to the project directory
and write access to a single plan markdown file.
The TUI then provides a glue mechanism that can then pipe that plan into a new instantiation of the
harness in execute mode.
Additionally we specify a schema for "surveys" that allow the model to ask the user questions about
the plan.
We also provide a hotkey (Ctrl+G or :edit-plan) that allows opening the plan in the users `$EDITOR`.
## Sub-Agents
- Independent context windows with summary passed back to parent
@ -38,68 +120,3 @@
- Plan executor is a specialized sub-agent where the plan replaces the summary
- Direct user interaction with sub-agents deferred
## Tool System
- Built-in tool system with `Tool` trait
- Core tools: `read_file`, `write_file`, `edit_file`, `shell_exec`, `list_directory`, `search_files`
- Approval gates by risk level: auto-approve (reads), confirm (writes/shell), deny
- MCP not implemented but interface designed to allow future adapter
## Sandboxing
- **Landlock** (Linux kernel-level):
- Read-only: system-wide (`/`)
- Read-write: project directory, temp directory
- Network: blocked by default, toggleable via `:net on/off`
- Graceful degradation on older kernels
- All tool execution goes through `Sandbox` — tools never touch filesystem directly
## Session Logging
- JSONL format, one event per line
- Events: user message, assistant message, tool call, tool result, sub-agent spawn/result, plan created, plan step status
- Tree-addressable via parent IDs (enables conversation branching later)
- Token usage stored per event
- Linear UX for now, branching deferred
## Testing Strategy
### Unit Tests
- **`provider`:** SSE stream parsing from byte fixtures, message/tool serialization, `StreamEvent` variant correctness
- **`tools`:** Path canonicalization, traversal prevention, risk level classification, registry dispatch
- **`sandbox`:** Landlock policy construction, path validation logic (without applying kernel rules)
- **`core`:** Conversation tree operations (insert, query by parent, turn computation, token totals), orchestrator state machine transitions against mock `StreamEvent` sequences
- **`session`:** JSONL serialization roundtrips, parent ID chain reconstruction
- **`tui`:** Widget rendering via Ratatui `TestBackend`
### Integration Tests — Component Boundaries
- **Core ↔ Provider:** Mock `ModelProvider` replaying recorded API sessions (full SSE streams with tool use). Tests the complete orchestration loop deterministically without network.
- **Core ↔ TUI (channel boundary):** Orchestrator with mock provider connected to channels. Assert correct `UIEvent` sequence, inject `UserAction` messages, verify approval/denial flow.
- **Tools ↔ Sandbox:** Real file operations and shell commands in temp directories. Verify write confinement, path traversal rejection, network blocking. Skip Landlock-specific tests on older kernels in CI.
### Integration Tests — End to End
- **Recorded session replay:** Capture real Claude API HTTP request/response pairs, replay deterministically. Exercises full stack (core + channel + mock TUI) without cost or network dependency. Primary E2E test strategy.
- **Live API tests:** Small suite behind feature flag / env var. Verifies real API integration. Run manually before releases, not in CI.
### Benchmarking — SWE-bench
- **Target:** SWE-bench Verified (500 curated problems) as primary benchmark
- **Secondary:** SWE-bench Pro for testing planning mode on longer-horizon tasks
- **Approach:** Headless mode (core without TUI) produces unified diff patches, evaluated via SWE-bench Docker harness
- **Baseline:** mini-swe-agent (~100 lines Python, >74% on Verified) as calibration — if we score significantly below with same model, the issue is scaffolding
- **Cadence:** Milestone checks, not continuous CI (too expensive/slow)
- **Requirements:** x86_64, 120GB+ storage, 16GB RAM, 8 CPU cores
### Test Sequencing
- Phase 1: Unit tests for SSE parser, event types, message serialization
- Phase 4: Recorded session replay infrastructure (core loop complex enough to warrant it)
- Phase 6-7: Headless mode + first SWE-bench Verified run
## Configuration (Deferred)
- Single-user, hardcoded defaults for now
- Designed for later: global config, per-project `.agent.toml`, configurable keybindings
## Deferred Features
- Conversation branching (tree structure in log, linear UX for now)
- Direct sub-agent interaction
- MCP adapter
- Full markdown/syntax-highlighted rendering
- Session log viewer
- Per-project configuration
- Structured plan editor in TUI (use `$EDITOR` for now)

694
PLAN.md
View file

@ -1,96 +1,616 @@
# Implementation Plan
# Skate Implementation Plan
## Phase 4: Sandboxing
This plan closes the gaps between the current codebase and the goals stated in DESIGN.md.
The phases are ordered by dependency -- each phase builds on the previous.
### Step 4.1: Create sandbox module with policy types and tracing foundation
- `SandboxPolicy` struct: read-only paths, read-write paths, network allowed bool
- `Sandbox` struct holding policy + working dir
- Add `tracing` spans and events throughout from the start:
- `#[instrument]` on all public `Sandbox` methods
- `debug!` on policy construction with path lists
- `info!` on sandbox creation with full policy summary
- No enforcement yet, just the type skeleton and module wiring
- **Files:** new `src/sandbox/mod.rs`, `src/sandbox/policy.rs`
- **Done when:** compiles, unit tests for policy construction, `RUST_LOG=debug cargo test` shows sandbox trace output
## Current State Summary
### Step 4.2: Landlock policy builder with startup gate and tracing
- Translate `SandboxPolicy` into Landlock ruleset using `landlock` crate
- Kernel requirements:
- **ABI v4 (kernel 6.7+):** minimum required -- provides both filesystem and network sandboxing
- ABI 1-3 have filesystem only, no network restriction -- tools could exfiltrate data freely
- Startup behavior -- on launch, check Landlock ABI version:
- ABI >= 4: proceed normally (full filesystem + network sandboxing)
- ABI < 4 (including unsupported): **refuse to start** with clear error: "Landlock ABI v4+ required (kernel 6.7+). Use --yolo to run without sandboxing."
- `--yolo` flag: skip all Landlock enforcement, log `warn!` at startup, show "UNSANDBOXED" in status bar permanently
- Landlock applied per-child-process via `pre_exec`, NOT to the main process
- Main process needs unrestricted network (Claude API) and filesystem (provider)
- Each `exec_command` child gets the current policy at spawn time
- `:net on/off` takes effect on the next spawned command
- Tracing:
- `info!` on kernel ABI version detected
- `debug!` for each rule added to ruleset (path, access flags)
- `warn!` on `--yolo` mode ("running without kernel sandboxing")
- `error!` if ruleset creation fails unexpectedly
- **Files:** `src/sandbox/landlock.rs`, add `landlock` dep to `Cargo.toml`, update CLI args in `src/app/`
- **Done when:** unit test constructs ruleset without panic; `--yolo` flag works on unsupported kernel; startup refuses without flag on unsupported kernel
Phase 0 (core loop) is functionally complete: the TUI renders conversations, the
orchestrator drives the Claude API, tools execute inside a Landlock sandbox, and the
channel boundary between TUI and core is properly maintained.
### Step 4.3: Sandbox file I/O API with operation tracing
- `Sandbox::read_file`, `Sandbox::write_file`, `Sandbox::list_directory`
- Move `validate_path` from `src/tools/mod.rs` into sandbox
- Tracing:
- `debug!` on every file operation: requested path, canonical path, allowed/denied
- `trace!` for path validation steps (join, canonicalize, starts_with check)
- `warn!` on path escape attempts (log the attempted path for debugging)
- `debug!` on successful operations with bytes read/written
- **Files:** `src/sandbox/mod.rs`
- **Done when:** unit tests in tempdir pass; path traversal rejected; `RUST_LOG=trace` shows full path resolution chain
The major gaps are:
### Step 4.4: Sandbox command execution with process tracing
- `Sandbox::exec_command(cmd, args, working_dir)` spawns child process with Landlock applied
- Captures stdout/stderr, enforces timeout
- Tracing:
- `info!` on command spawn: command, args, working_dir, timeout
- `debug!` on command completion: exit code, stdout/stderr byte lengths, duration
- `warn!` on non-zero exit codes
- `error!` on timeout or spawn failure with full context
- `trace!` for Landlock application to child process thread
- **Files:** `src/sandbox/mod.rs` or `src/sandbox/exec.rs`
- **Done when:** unit test runs `echo hello` in tempdir; write outside sandbox fails (on supported kernels)
1. Tool executor tarpc interface -- the orchestrator calls tools directly rather than
via a tarpc client/server split as DESIGN.md specifies. This is the biggest
structural gap and a prerequisite for sub-agents (each agent gets its own client).
2. Session logging (JSONL, tree-addressable) -- no `session/` module exists yet.
3. Token tracking -- counts are debug-logged but not surfaced to the user.
4. TUI introspection -- tool blocks and thinking traces cannot be expanded/collapsed.
5. Status bar is sparse -- no token totals, no activity mode, no network state badge.
6. Planning Mode -- no dedicated harness instantiation with restricted sandbox.
7. Sub-agents -- no spawning mechanism, no independent context windows.
8. Space-bar leader key and which-key help overlay are absent.
### Step 4.5: Wire tools through Sandbox
- Change `Tool::execute` signature to accept `&Sandbox` instead of (or in addition to) `&Path`
- Update all 4 built-in tools to call `Sandbox` methods instead of `std::fs`/`std::process::Command`
- Remove direct `std::fs` usage from tool implementations
- Update `ToolRegistry` and orchestrator to pass `Sandbox`
- Tracing: tools now inherit sandbox spans automatically via `#[instrument]`
- **Files:** `src/tools/*.rs`, `src/tools/mod.rs`, `src/core/orchestrator.rs`
- **Done when:** all existing tool tests pass through Sandbox; no direct `std::fs` in tool files; `RUST_LOG=debug cargo run` shows sandbox operations during tool execution
---
### Step 4.6: Network toggle
- `network_allowed: bool` in `SandboxPolicy`
- `:net on/off` TUI command parsed in input handler, sent as `UserAction::SetNetworkPolicy(bool)`
- Orchestrator updates `Sandbox` policy. Status bar shows network state.
- Only available when Landlock ABI >= 4 (kernel 6.7+); command hidden otherwise
- Status bar shows: network state when available, "UNSANDBOXED" in `--yolo` mode
- Tracing: `info!` on network policy change
- **Files:** `src/tui/input.rs`, `src/tui/render.rs`, `src/core/types.rs`, `src/core/orchestrator.rs`, `src/sandbox/mod.rs`
- **Done when:** toggling `:net` updates status bar; Landlock network restriction applied on ABI >= 4
## Phase 1 -- Tool Executor tarpc Interface
### Step 4.7: Integration tests
- Tools + Sandbox in tempdir: write confinement, path traversal rejection, shell command confinement
- Skip Landlock-specific assertions on ABI < 4
- Test `--yolo` mode: sandbox constructed but no kernel enforcement
- Test startup gate: verify error on ABI < 4 without `--yolo`
- Tests should assert tracing output where relevant (use `tracing-test` crate or `tracing_subscriber::fmt::TestWriter`)
- **Files:** `tests/sandbox.rs`
- **Done when:** `cargo test --test sandbox` passes
**Goal:** Introduce the harness/executor split described in DESIGN.md. The executor
owns the `ToolRegistry` and `Sandbox`; the orchestrator (harness) communicates with
it exclusively through a tarpc client. In this phase the transport is in-process
(tarpc's unbounded channel pair), laying the groundwork for out-of-process execution
in a later phase.
### Phase 4 verification (end-to-end)
1. `cargo test` -- all tests pass
2. `cargo clippy -- -D warnings` -- zero warnings
3. `RUST_LOG=debug cargo run -- --project-dir .` -- ask Claude to read a file, observe sandbox trace logs showing path validation and Landlock policy
4. Ask Claude to write a file outside project dir -- sandbox denies with `warn!` log
5. Ask Claude to run a shell command -- observe command spawn/completion trace
6. `:net off` then ask for network access -- verify blocked
7. Without `--yolo` on ABI < 4: verify startup refuses with clear error
8. With `--yolo`: verify startup succeeds, "UNSANDBOXED" in status bar, `warn!` in logs
This is the largest structural change in the plan. Every subsequent phase benefits
from the cleaner boundary: sub-agents each get their own executor client (Phase 7),
and the sandbox policy becomes a constructor argument to the executor rather than
something threaded through the orchestrator.
### 1.1 Define the tarpc service
Create `src/executor/mod.rs`:
```rust
#[tarpc::service]
pub trait Executor {
/// Return the full list of tools this executor exposes, including their
/// JSON Schema input descriptors. The harness calls this once at startup
/// and caches the result for the lifetime of the conversation.
async fn list_tools() -> Vec<ToolDefinition>;
/// Invoke a single tool by name with a JSON-encoded argument object.
/// Returns the text content to feed back to the model, or an error string
/// that is also fed back (so the model can self-correct).
async fn call_tool(name: String, input: serde_json::Value) -> Result<String, String>;
}
```
`ToolDefinition` is already defined in `core/types.rs` and is provider-agnostic --
no new types are needed on the wire.
### 1.2 Implement `ExecutorServer`
Still in `src/executor/mod.rs`, add:
```rust
pub struct ExecutorServer {
registry: ToolRegistry,
sandbox: Arc<Sandbox>,
}
impl ExecutorServer {
pub fn new(registry: ToolRegistry, sandbox: Sandbox) -> Self { ... }
}
impl Executor for ExecutorServer {
async fn list_tools(self, _: Context) -> Vec<ToolDefinition> {
self.registry.definitions()
}
async fn call_tool(self, _: Context, name: String, input: Value) -> Result<String, String> {
match self.registry.get(&name) {
None => Err(format!("unknown tool: {name}")),
Some(tool) => tool
.execute(input, &self.sandbox)
.await
.map_err(|e| e.to_string()),
}
}
}
```
The `Arc<Sandbox>` is required because tarpc clones the server struct per request.
### 1.3 In-process transport helper
Add a function to `src/executor/mod.rs` (and re-export from `src/app/mod.rs`) that
wires an `ExecutorServer` to a client over tarpc's in-memory channel:
```rust
/// Spawn an ExecutorServer on the current tokio runtime and return a client
/// connected to it via an in-process channel. The server task runs until
/// the client is dropped.
pub fn spawn_local(server: ExecutorServer) -> ExecutorClient {
let (client_transport, server_transport) = tarpc::transport::channel::unbounded();
let server = tarpc::server::BaseChannel::with_defaults(server_transport);
tokio::spawn(server.execute(ExecutorServer::serve(/* ... */)));
ExecutorClient::new(tarpc::client::Config::default(), client_transport).spawn()
}
```
### 1.4 Refactor `Orchestrator` to use the client
Currently `Orchestrator<P>` holds `ToolRegistry` and `Sandbox` directly and calls
`tool.execute(input, &sandbox)` in `run_turn`. Replace these fields with:
```rust
executor: ExecutorClient,
tool_definitions: Vec<ToolDefinition>, // fetched once at construction
```
`run_turn` changes from direct tool dispatch to:
```rust
let result = self.executor
.call_tool(context::current(), name, input)
.await;
```
The `tool_definitions` vec is passed to `provider.stream()` instead of being built
from the registry on each call.
### 1.5 Update `app/mod.rs`
Replace the inline construction of `ToolRegistry + Sandbox` in `app::run` with:
```rust
let registry = build_tool_registry();
let sandbox = Sandbox::new(policy, project_dir, enforcement)?;
let executor = executor::spawn_local(ExecutorServer::new(registry, sandbox));
let orchestrator = Orchestrator::new(provider, executor, system_prompt);
```
### 1.6 Tests
- Unit: `ExecutorServer::call_tool` with a mock `ToolRegistry` returns correct
output and maps errors to `Err(String)`.
- Integration: `spawn_local` -> `client.call_tool` round-trip through the in-process
channel executes a real `read_file` against a temp dir.
- Integration: existing orchestrator integration tests continue to pass after the
refactor (the mock provider path is unchanged; only tool dispatch changes).
### 1.7 Files touched
| Action | File |
|--------|------|
| New | `src/executor/mod.rs` |
| Modified | `src/core/orchestrator.rs` -- remove registry/sandbox, add executor client |
| Modified | `src/app/mod.rs` -- construct executor, pass client to orchestrator |
| Modified | `Cargo.toml` -- add `tarpc` with `tokio1` feature |
New dependency: `tarpc` (with `tokio1` and `serde-transport` features).
---
## Phase 2 -- Session Logging
**Goal:** Persist every event to a JSONL file. This is the foundation for token
accounting, session resume, and future conversation branching.
### 1.1 Add `src/session/` module
Create `src/session/mod.rs` with the following public surface:
```rust
pub struct SessionWriter { ... }
impl SessionWriter {
/// Open (or create) a JSONL log at the given path in append mode.
pub async fn open(path: &Path) -> Result<Self, SessionError>;
/// Append one event. Never rewrites history.
pub async fn append(&self, event: &LogEvent) -> Result<(), SessionError>;
}
pub struct SessionReader { ... }
impl SessionReader {
pub async fn load(path: &Path) -> Result<Vec<LogEvent>, SessionError>;
}
```
### 1.2 Define `LogEvent`
```rust
pub struct LogEvent {
pub id: Uuid,
pub parent_id: Option<Uuid>,
pub timestamp: DateTime<Utc>,
pub payload: LogPayload,
pub token_usage: Option<TokenUsage>,
}
pub enum LogPayload {
UserMessage { content: String },
AssistantMessage { content: Vec<ContentBlock> },
ToolCall { tool_name: String, input: serde_json::Value },
ToolResult { tool_use_id: String, content: String, is_error: bool },
}
pub struct TokenUsage {
pub input: u32,
pub output: u32,
pub cache_read: Option<u32>,
pub cache_write: Option<u32>,
}
```
`id` and `parent_id` form a tree that enables future branching. For now the
conversation is linear so `parent_id` is always the id of the previous event.
### 1.3 Wire into Orchestrator
- `Orchestrator` holds an `Option<SessionWriter>`.
- Every time the orchestrator pushes to `ConversationHistory` it also appends a
`LogEvent`. Token counts from `StreamEvent::InputTokens` / `OutputTokens` are
stored on the final assistant event of each turn.
- Session file lives at `.skate/sessions/<timestamp>.jsonl`.
### 1.4 Tests
- Unit: `SessionWriter::append` then `SessionReader::load` round-trips all payload
variants.
- Unit: parent_id chain is correct across a simulated multi-turn exchange.
- Integration: run the orchestrator with a mock provider against a temp dir; assert
the JSONL file is written.
---
## Phase 3 -- Token Tracking & Status Bar
**Goal:** Surface token usage in the TUI per-turn and cumulatively.
### 3.1 Per-turn token counts in UIEvent
Add a variant to `UIEvent`:
```rust
UIEvent::TurnComplete { input_tokens: u32, output_tokens: u32 }
```
The orchestrator already receives `StreamEvent::InputTokens` and `OutputTokens`;
it should accumulate them during a turn and emit them in `TurnComplete`.
### 3.2 AppState token counters
Add to `AppState`:
```rust
pub turn_input_tokens: u32,
pub turn_output_tokens: u32,
pub total_input_tokens: u64,
pub total_output_tokens: u64,
```
`events.rs` updates these on `TurnComplete`.
### 3.3 Status bar redesign
The status bar currently shows only the mode indicator. Expand it to four sections:
```
[ MODE ] [ ACTIVITY ] [ i:1234 o:567 | total i:9999 o:2345 ] [ NET: off ]
```
- **MODE** -- Normal / Insert / Command
- **ACTIVITY** -- Plan / Execute (Phase 4 adds Plan; for now always "Execute")
- **Tokens** -- per-turn input/output, then session cumulative
- **NET** -- `on` (green) or `off` (red) reflecting `network_allowed`
Update `render.rs` to implement this layout using Ratatui `Layout::horizontal`.
### 3.4 Tests
- Unit: `AppState` accumulates totals correctly across multiple `TurnComplete` events.
- TUI snapshot test (TestBackend): status bar renders all four sections with correct
content after a synthetic `TurnComplete`.
---
## Phase 4 -- TUI Introspection (Expand/Collapse)
**Goal:** Support progressive disclosure -- tool calls and thinking traces start
collapsed; the user can expand them.
### 4.1 Block model
Replace the flat `Vec<DisplayMessage>` in `AppState` with a `Vec<DisplayBlock>`:
```rust
pub enum DisplayBlock {
UserMessage { content: String },
AssistantText { content: String },
ToolCall {
display: ToolDisplay,
result: Option<String>,
expanded: bool,
},
Error { message: String },
}
```
### 4.2 Navigation in Normal mode
Add block-level cursor to `AppState`:
```rust
pub focused_block: Option<usize>,
```
Keybindings (Normal mode):
| Key | Action |
|-----|--------|
| `[` | Move focus to previous block |
| `]` | Move focus to next block |
| `Enter` or `Space` | Toggle `expanded` on focused ToolCall block |
| `j` / `k` | Line scroll (unchanged) |
The focused block is highlighted with a distinct border color.
### 4.3 Render changes
`render.rs` must calculate the height of each `DisplayBlock` depending on whether
it is collapsed (1-2 summary lines) or expanded (full content). The scroll offset
operates on pixel-rows, not message indices.
Collapsed tool call shows: `> tool_name(arg_summary) -- result_summary`
Expanded tool call shows: full input and output as formatted by `tool_display.rs`.
### 4.4 Tests
- Unit: toggling `expanded` on a `ToolCall` block changes height calculation.
- TUI snapshot: collapsed vs expanded render output for `WriteFile` and `ShellExec`.
---
## Phase 5 -- Space-bar Leader Key & Which-Key Overlay
**Goal:** Support vim-style `<Space>` leader chords for configuration actions. This
replaces the `:net on` / `:net off` text commands with discoverable hotkeys.
### 5.1 Leader key state machine
Extend `AppState` with:
```rust
pub leader_active: bool,
pub leader_timeout: Option<Instant>,
```
In Normal mode, pressing `Space` sets `leader_active = true` and starts a 1-second
timeout. The next key is dispatched through the chord table. If the timeout fires
or an unbound key is pressed, leader mode is cancelled with a brief status message.
### 5.2 Initial chord table
| Chord | Action |
|-------|--------|
| `<Space> n` | Toggle network policy |
| `<Space> c` | Clear history (`:clear`) |
| `<Space> p` | Switch to Plan mode (Phase 5) |
| `<Space> ?` | Toggle which-key overlay |
### 5.3 Which-key overlay
A centered popup rendered over the output pane that lists all available chords and
their descriptions. Rendered only when `leader_active = true` (after a short delay,
~200 ms, to avoid flicker on fast typists).
### 5.4 Remove `:net on/off` from command parser
Once leader-key network toggle is in place, remove the text-command duplicates to
keep the command palette small and focused.
### 5.5 Tests
- Unit: leader key state machine transitions (activate, timeout, chord match, cancel).
- TUI snapshot: which-key overlay renders with correct chord list.
---
## Phase 6 -- Planning Mode
**Goal:** A dedicated planning harness with restricted sandbox that writes a single
plan file, plus a mechanism to pipe the plan into an execute harness.
### 6.1 Plan harness sandbox policy
In planning mode the orchestrator is instantiated with a `SandboxPolicy` that grants:
- `/` -- ReadOnly (same as execute)
- `<project_dir>/.skate/plan.md` -- ReadWrite (only this file)
- Network -- off
All other write attempts fail with a sandbox permission error returned to the model.
### 6.2 Survey tool
Add a new tool `ask_user` that allows the model to present structured questions to
the user during planning:
```rust
// Input schema
{
"question": "string",
"options": ["string"] | null // null means free-text answer
}
```
The orchestrator sends a new `UIEvent::SurveyRequest { question, options }`. The TUI
renders an inline prompt. The user's answer is sent back as a `UserAction::SurveyResponse`.
### 6.3 TUI activity mode
`AppState` gets:
```rust
pub activity: Activity,
pub enum Activity { Plan, Execute }
```
Switching activity (via `<Space> p`) instantiates a new orchestrator on a fresh
channel pair. The old orchestrator is shut down cleanly. The status bar ACTIVITY
section updates.
### 6.4 Plan -> Execute handoff
When the user is satisfied with the plan (`<Space> x` or `:exec`):
1. TUI reads `.skate/plan.md`.
2. Constructs a new system prompt: `<original system prompt>\n\n## Plan\n<plan content>`.
3. Instantiates an Execute orchestrator with the full sandbox policy and the
augmented system prompt.
4. Transitions `activity` to `Execute`.
The old Plan orchestrator is dropped.
### 6.5 Edit plan in $EDITOR
Hotkey `<Space> e` (or `:edit-plan`) suspends the TUI (restores terminal), opens
`$EDITOR` on `.skate/plan.md`, then resumes the TUI after the editor exits.
### 6.6 Tests
- Integration: plan harness rejects write to a file other than plan.md.
- Integration: survey tool round-trip through channel boundary.
- Unit: plan -> execute handoff produces correct augmented system prompt.
---
## Phase 7 -- Sub-Agents
**Goal:** The model can spawn independent sub-agents with their own context windows.
Results are summarised and returned to the parent.
### 7.1 `spawn_agent` tool
Add a new tool with input schema:
```rust
{
"task": "string", // instruction for the sub-agent
"sandbox": { // optional policy overrides
"network": bool,
"extra_write_paths": ["string"]
}
}
```
### 7.2 Sub-agent lifecycle
When `spawn_agent` executes:
1. Create a new `Orchestrator` with an independent conversation history.
2. The sub-agent's system prompt is the parent's system prompt plus the task
description.
3. The sub-agent runs autonomously (no user interaction) until it emits a
`UserAction::Quit` equivalent or hits `MAX_TOOL_ITERATIONS`.
4. The final assistant message is returned as the tool result (the "summary").
5. The sub-agent's session is logged to a child JSONL file linked to the parent
session by a `parent_session_id` field.
### 7.3 TUI sub-agent view
The agent tree is accessible via `<Space> a`. A side panel shows:
```
Parent
+-- sub-agent 1 [running]
+-- sub-agent 2 [done]
```
Pressing Enter on a sub-agent opens a read-only replay of its conversation (scroll
only, no input). This is a stretch goal within this phase -- the core spawning
mechanism is the priority.
### 7.4 Tests
- Integration: spawn_agent with a mock provider runs to completion and returns a
summary string.
- Unit: sub-agent session file has correct parent_session_id link.
- Unit: MAX_TOOL_ITERATIONS limit is respected within sub-agents.
In this phase `spawn_agent` gains a natural implementation: it calls
`executor::spawn_local` with a new `ExecutorServer` configured for the child policy,
constructs a new `Orchestrator` with that client, and runs it to completion. The
tarpc boundary from Phase 1 makes this straightforward.
---
## Phase 8 -- Prompt Caching
**Goal:** Use Anthropic's prompt caching to reduce cost and latency on long
conversations. DESIGN.md notes this as a desired property of message construction.
### 8.1 Cache breakpoints
The Anthropic API supports `"cache_control": {"type": "ephemeral"}` on message
content blocks. The optimal strategy is to mark the last user message of the longest
stable prefix as a cache write point.
In `provider/claude.rs`, when serializing the messages array:
- Mark the system prompt content block with `cache_control` (it never changes).
- Mark the penultimate user message with `cache_control` (the conversation history
that is stable for the current turn).
### 8.2 Cache token tracking
The `TokenUsage` struct in `session/` already reserves `cache_read` and
`cache_write` fields. `StreamEvent` must be extended:
```rust
StreamEvent::CacheReadTokens(u32),
StreamEvent::CacheWriteTokens(u32),
```
The Anthropic `message_start` event contains `usage.cache_read_input_tokens` and
`usage.cache_creation_input_tokens`. Parse these and emit the new variants.
### 8.3 Status bar update
Add cache tokens to the status bar display: `i:1234(c:800) o:567`.
### 8.4 Tests
- Provider unit test: replay a fixture that contains cache token fields; assert the
new StreamEvent variants are emitted.
- Snapshot test: status bar renders cache token counts correctly.
---
## Dependency Graph
```
Phase 1 (tarpc executor)
|
+-- Phase 2 (session logging) -- orchestrator refactor is complete
| |
| +-- Phase 3 (token tracking) -- requires session TokenUsage struct
| |
| +-- Phase 7 (sub-agents) -- requires session parent_session_id
|
+-- Phase 7 (sub-agents) -- spawn_local reuse is natural after Phase 1
Phase 4 (expand/collapse) -- independent, can be done alongside Phase 3
Phase 5 (leader key) -- independent, prerequisite for Phase 6
Phase 6 (planning mode) -- requires Phase 5 (leader key chord <Space> p)
-- benefits from Phase 1 (separate executor per activity)
Phase 8 (prompt caching) -- requires Phase 3 (cache token display)
```
Recommended order: 1 -> 2 -> 3 -> 4 -> 5 -> 6 -> 8, with 7 after 2 and 6.
---
## Files Touched Per Phase
| Phase | New Files | Modified Files |
|-------|-----------|----------------|
| 1 | `src/executor/mod.rs` | `src/core/orchestrator.rs`, `src/core/types.rs`, `src/app/mod.rs`, `Cargo.toml` |
| 2 | `src/session/mod.rs` | `src/core/orchestrator.rs`, `src/app/mod.rs` |
| 3 | -- | `src/core/types.rs`, `src/core/orchestrator.rs`, `src/tui/events.rs`, `src/tui/render.rs` |
| 4 | -- | `src/tui/mod.rs`, `src/tui/render.rs`, `src/tui/events.rs`, `src/tui/input.rs` |
| 5 | -- | `src/tui/input.rs`, `src/tui/render.rs`, `src/tui/mod.rs` |
| 6 | `src/tools/ask_user.rs` | `src/core/types.rs`, `src/core/orchestrator.rs`, `src/tui/mod.rs`, `src/tui/input.rs`, `src/tui/render.rs`, `src/app/mod.rs` |
| 7 | -- | `src/executor/mod.rs`, `src/core/orchestrator.rs`, `src/tui/render.rs`, `src/tui/input.rs` |
| 8 | -- | `src/provider/claude.rs`, `src/core/types.rs`, `src/session/mod.rs`, `src/tui/render.rs` |
---
## New Dependencies
| Crate | Phase | Reason |
|-------|-------|--------|
| `tarpc` | 1 | RPC service trait + in-process transport |
| `uuid` | 2 | LogEvent ids |
| `chrono` | 2 | Event timestamps (check if already transitive) |
No other new dependencies are needed. All other required functionality
(`serde_json`, `tokio`, `ratatui`, `tracing`) is already present.