Update the design and PLAN.md (#11)
Reviewed-on: #11 Co-authored-by: Drew Galbraith <drew@tiramisu.one> Co-committed-by: Drew Galbraith <drew@tiramisu.one>
This commit is contained in:
parent
669e05b716
commit
7420755800
2 changed files with 707 additions and 170 deletions
183
DESIGN.md
183
DESIGN.md
|
|
@ -1,19 +1,96 @@
|
|||
# Design Decisions
|
||||
# Skate Design
|
||||
|
||||
This is a TUI coding agent harness built for one user. The unique design goals compared
|
||||
to other coding agents are:
|
||||
|
||||
1) Allow autonomous execution without permission prompts without fully sacrificing security.
|
||||
The user can configure what permissions the coding agent has before execution and these
|
||||
are enforced using kernel-level sandboxing.
|
||||
|
||||
2) The UI supports introspection to better understand how the harness is performing.
|
||||
Information may start collapsed, but it is possible to introspect things like tool uses
|
||||
and thinking chains. Additionally token usage is elevated to understand where the harness
|
||||
is performing inefficiently.
|
||||
|
||||
3) The UI is modal and supports neovim like hotkeys for navigation and configuratiorn
|
||||
(i.e. using the space bar as a leader key). We prefer having hotkeys over adding custom
|
||||
slash commands (/model) to the text chat interface. The text chat should be reserved for
|
||||
things that go straight to the underlying model.
|
||||
|
||||
## Stack
|
||||
- **Language:** Rust
|
||||
- **TUI Framework:** Ratatui + Crossterm
|
||||
- **Async Runtime:** Tokio
|
||||
|
||||
## Architecture
|
||||
- Channel boundary between TUI and core (fully decoupled)
|
||||
- Module decomposition: `app`, `tui`, `core`, `provider`, `tools`, `sandbox`, `session`
|
||||
- Headless mode: core without TUI, driven by script (enables benchmarking and CI)
|
||||
|
||||
## Model Integration
|
||||
- Claude-first, multi-model via `ModelProvider` trait
|
||||
- Common `StreamEvent` internal representation across providers
|
||||
- Prompt caching-aware message construction
|
||||
The coding agent is broken into three main components, the TUI, the harness, and the tool executor.
|
||||
|
||||
The harness communicates with the tool executor via a tarpc interface.
|
||||
|
||||
The TUI and harness communicate over a Channel boundary and are fully decoupled
|
||||
in a way that supports running the harness without the TUI (i.e. in scripting mode).
|
||||
|
||||
## Harness Design
|
||||
|
||||
The harness follows a fairly straightforward design loop.
|
||||
|
||||
1. Send message to underlying model.
|
||||
2. If model requests a tool use, execute it (via a call to the executor) and return to 1.
|
||||
3. Else, wait for further user input.
|
||||
|
||||
### Harness Instantiation
|
||||
|
||||
The harness is instantiated with a system prompt and a tarpc client to the tool executor.
|
||||
(In the first iteration we use an in process channel for the tarpc client).
|
||||
|
||||
### Model Integration
|
||||
|
||||
The harness uses a trait system to make it agnostic to the underlying coding agent used.
|
||||
|
||||
This trait unifies a variety of APIs using a `StreamEvent` interface for streaming responses
|
||||
from the API.
|
||||
|
||||
Currently, only Anthropic's Claude API is supported.
|
||||
|
||||
Messages are constructed in such a way to support prompt caching when available.
|
||||
|
||||
### Session Logging
|
||||
- JSONL format, one event per line
|
||||
- Events: user message, assistant message, tool call, tool result.
|
||||
- Tree-addressable via parent IDs (enables conversation branching later)
|
||||
- Token usage stored per event
|
||||
- Linear UX for now, branching deferred
|
||||
|
||||
## Executor Design
|
||||
|
||||
The key aspect of the executor design is that is configured with sandbox permissions
|
||||
that allow tool use without any user prompting. Either the tool use succeeds within the
|
||||
sandbox and is returned to the model or it fails with a permission error to the model.
|
||||
|
||||
The sandboxing allows running arbitrary shell commands without prompting.
|
||||
|
||||
### Executor Interface
|
||||
|
||||
The executor interface exposed to the harness has the following methods.
|
||||
|
||||
- list_available_tools: takes no arguments and returns tool names, descriptions, and argument schema.
|
||||
- call_tool: takes a tool name and its arguments and returns either a result or an error.
|
||||
|
||||
### Sandboxing
|
||||
|
||||
Sandboxing is done using the linux kernel feature "Landlock".
|
||||
|
||||
This allows restricting file system access (either read only, read/write, or no access)
|
||||
as well as network access (either on/off).
|
||||
|
||||
## TUI Design
|
||||
|
||||
The bulk of the complexity of this coding agent is pushed to TUI in this design.
|
||||
|
||||
The driving goals of the TUI are:
|
||||
|
||||
- Support (neo)vim style keyboard navigation and modal editing.
|
||||
- Full progressive discloure of information, high level information is grokable at a glance
|
||||
but full tool use and thinking traces can be expanded.
|
||||
- Support for instantiating multiple different instances of the core harness (i.e. different
|
||||
instantiations for code review vs planning vs implementation).
|
||||
|
||||
## UI
|
||||
- **Agent view:** Tree-based hierarchy (not flat tabs) for sub-agent inspection
|
||||
|
|
@ -24,12 +101,17 @@
|
|||
- **Status bar:** Mode indicator, current activity (Plan/Execute), token totals, network policy state
|
||||
|
||||
## Planning Mode
|
||||
- Distinct activity from execution — planner agent produces a plan file, does not execute
|
||||
- Plan file is structured markdown: steps with descriptions, files involved, acceptance criteria
|
||||
- Plan is reviewable and editable before execution (`:edit-plan` opens `$EDITOR`)
|
||||
- User explicitly approves plan before execution begins
|
||||
- Executor agent receives the plan file + project context, not the planning conversation
|
||||
- Plan-step progress tracked during execution (complete/in-progress/failed)
|
||||
|
||||
In planning mode the TUI instantiates a harness with read access to the project directory
|
||||
and write access to a single plan markdown file.
|
||||
|
||||
The TUI then provides a glue mechanism that can then pipe that plan into a new instantiation of the
|
||||
harness in execute mode.
|
||||
|
||||
Additionally we specify a schema for "surveys" that allow the model to ask the user questions about
|
||||
the plan.
|
||||
|
||||
We also provide a hotkey (Ctrl+G or :edit-plan) that allows opening the plan in the users `$EDITOR`.
|
||||
|
||||
## Sub-Agents
|
||||
- Independent context windows with summary passed back to parent
|
||||
|
|
@ -38,68 +120,3 @@
|
|||
- Plan executor is a specialized sub-agent where the plan replaces the summary
|
||||
- Direct user interaction with sub-agents deferred
|
||||
|
||||
## Tool System
|
||||
- Built-in tool system with `Tool` trait
|
||||
- Core tools: `read_file`, `write_file`, `edit_file`, `shell_exec`, `list_directory`, `search_files`
|
||||
- Approval gates by risk level: auto-approve (reads), confirm (writes/shell), deny
|
||||
- MCP not implemented but interface designed to allow future adapter
|
||||
|
||||
## Sandboxing
|
||||
- **Landlock** (Linux kernel-level):
|
||||
- Read-only: system-wide (`/`)
|
||||
- Read-write: project directory, temp directory
|
||||
- Network: blocked by default, toggleable via `:net on/off`
|
||||
- Graceful degradation on older kernels
|
||||
- All tool execution goes through `Sandbox` — tools never touch filesystem directly
|
||||
|
||||
## Session Logging
|
||||
- JSONL format, one event per line
|
||||
- Events: user message, assistant message, tool call, tool result, sub-agent spawn/result, plan created, plan step status
|
||||
- Tree-addressable via parent IDs (enables conversation branching later)
|
||||
- Token usage stored per event
|
||||
- Linear UX for now, branching deferred
|
||||
|
||||
## Testing Strategy
|
||||
|
||||
### Unit Tests
|
||||
- **`provider`:** SSE stream parsing from byte fixtures, message/tool serialization, `StreamEvent` variant correctness
|
||||
- **`tools`:** Path canonicalization, traversal prevention, risk level classification, registry dispatch
|
||||
- **`sandbox`:** Landlock policy construction, path validation logic (without applying kernel rules)
|
||||
- **`core`:** Conversation tree operations (insert, query by parent, turn computation, token totals), orchestrator state machine transitions against mock `StreamEvent` sequences
|
||||
- **`session`:** JSONL serialization roundtrips, parent ID chain reconstruction
|
||||
- **`tui`:** Widget rendering via Ratatui `TestBackend`
|
||||
|
||||
### Integration Tests — Component Boundaries
|
||||
- **Core ↔ Provider:** Mock `ModelProvider` replaying recorded API sessions (full SSE streams with tool use). Tests the complete orchestration loop deterministically without network.
|
||||
- **Core ↔ TUI (channel boundary):** Orchestrator with mock provider connected to channels. Assert correct `UIEvent` sequence, inject `UserAction` messages, verify approval/denial flow.
|
||||
- **Tools ↔ Sandbox:** Real file operations and shell commands in temp directories. Verify write confinement, path traversal rejection, network blocking. Skip Landlock-specific tests on older kernels in CI.
|
||||
|
||||
### Integration Tests — End to End
|
||||
- **Recorded session replay:** Capture real Claude API HTTP request/response pairs, replay deterministically. Exercises full stack (core + channel + mock TUI) without cost or network dependency. Primary E2E test strategy.
|
||||
- **Live API tests:** Small suite behind feature flag / env var. Verifies real API integration. Run manually before releases, not in CI.
|
||||
|
||||
### Benchmarking — SWE-bench
|
||||
- **Target:** SWE-bench Verified (500 curated problems) as primary benchmark
|
||||
- **Secondary:** SWE-bench Pro for testing planning mode on longer-horizon tasks
|
||||
- **Approach:** Headless mode (core without TUI) produces unified diff patches, evaluated via SWE-bench Docker harness
|
||||
- **Baseline:** mini-swe-agent (~100 lines Python, >74% on Verified) as calibration — if we score significantly below with same model, the issue is scaffolding
|
||||
- **Cadence:** Milestone checks, not continuous CI (too expensive/slow)
|
||||
- **Requirements:** x86_64, 120GB+ storage, 16GB RAM, 8 CPU cores
|
||||
|
||||
### Test Sequencing
|
||||
- Phase 1: Unit tests for SSE parser, event types, message serialization
|
||||
- Phase 4: Recorded session replay infrastructure (core loop complex enough to warrant it)
|
||||
- Phase 6-7: Headless mode + first SWE-bench Verified run
|
||||
|
||||
## Configuration (Deferred)
|
||||
- Single-user, hardcoded defaults for now
|
||||
- Designed for later: global config, per-project `.agent.toml`, configurable keybindings
|
||||
|
||||
## Deferred Features
|
||||
- Conversation branching (tree structure in log, linear UX for now)
|
||||
- Direct sub-agent interaction
|
||||
- MCP adapter
|
||||
- Full markdown/syntax-highlighted rendering
|
||||
- Session log viewer
|
||||
- Per-project configuration
|
||||
- Structured plan editor in TUI (use `$EDITOR` for now)
|
||||
|
|
|
|||
694
PLAN.md
694
PLAN.md
|
|
@ -1,96 +1,616 @@
|
|||
# Implementation Plan
|
||||
# Skate Implementation Plan
|
||||
|
||||
## Phase 4: Sandboxing
|
||||
This plan closes the gaps between the current codebase and the goals stated in DESIGN.md.
|
||||
The phases are ordered by dependency -- each phase builds on the previous.
|
||||
|
||||
### Step 4.1: Create sandbox module with policy types and tracing foundation
|
||||
- `SandboxPolicy` struct: read-only paths, read-write paths, network allowed bool
|
||||
- `Sandbox` struct holding policy + working dir
|
||||
- Add `tracing` spans and events throughout from the start:
|
||||
- `#[instrument]` on all public `Sandbox` methods
|
||||
- `debug!` on policy construction with path lists
|
||||
- `info!` on sandbox creation with full policy summary
|
||||
- No enforcement yet, just the type skeleton and module wiring
|
||||
- **Files:** new `src/sandbox/mod.rs`, `src/sandbox/policy.rs`
|
||||
- **Done when:** compiles, unit tests for policy construction, `RUST_LOG=debug cargo test` shows sandbox trace output
|
||||
## Current State Summary
|
||||
|
||||
### Step 4.2: Landlock policy builder with startup gate and tracing
|
||||
- Translate `SandboxPolicy` into Landlock ruleset using `landlock` crate
|
||||
- Kernel requirements:
|
||||
- **ABI v4 (kernel 6.7+):** minimum required -- provides both filesystem and network sandboxing
|
||||
- ABI 1-3 have filesystem only, no network restriction -- tools could exfiltrate data freely
|
||||
- Startup behavior -- on launch, check Landlock ABI version:
|
||||
- ABI >= 4: proceed normally (full filesystem + network sandboxing)
|
||||
- ABI < 4 (including unsupported): **refuse to start** with clear error: "Landlock ABI v4+ required (kernel 6.7+). Use --yolo to run without sandboxing."
|
||||
- `--yolo` flag: skip all Landlock enforcement, log `warn!` at startup, show "UNSANDBOXED" in status bar permanently
|
||||
- Landlock applied per-child-process via `pre_exec`, NOT to the main process
|
||||
- Main process needs unrestricted network (Claude API) and filesystem (provider)
|
||||
- Each `exec_command` child gets the current policy at spawn time
|
||||
- `:net on/off` takes effect on the next spawned command
|
||||
- Tracing:
|
||||
- `info!` on kernel ABI version detected
|
||||
- `debug!` for each rule added to ruleset (path, access flags)
|
||||
- `warn!` on `--yolo` mode ("running without kernel sandboxing")
|
||||
- `error!` if ruleset creation fails unexpectedly
|
||||
- **Files:** `src/sandbox/landlock.rs`, add `landlock` dep to `Cargo.toml`, update CLI args in `src/app/`
|
||||
- **Done when:** unit test constructs ruleset without panic; `--yolo` flag works on unsupported kernel; startup refuses without flag on unsupported kernel
|
||||
Phase 0 (core loop) is functionally complete: the TUI renders conversations, the
|
||||
orchestrator drives the Claude API, tools execute inside a Landlock sandbox, and the
|
||||
channel boundary between TUI and core is properly maintained.
|
||||
|
||||
### Step 4.3: Sandbox file I/O API with operation tracing
|
||||
- `Sandbox::read_file`, `Sandbox::write_file`, `Sandbox::list_directory`
|
||||
- Move `validate_path` from `src/tools/mod.rs` into sandbox
|
||||
- Tracing:
|
||||
- `debug!` on every file operation: requested path, canonical path, allowed/denied
|
||||
- `trace!` for path validation steps (join, canonicalize, starts_with check)
|
||||
- `warn!` on path escape attempts (log the attempted path for debugging)
|
||||
- `debug!` on successful operations with bytes read/written
|
||||
- **Files:** `src/sandbox/mod.rs`
|
||||
- **Done when:** unit tests in tempdir pass; path traversal rejected; `RUST_LOG=trace` shows full path resolution chain
|
||||
The major gaps are:
|
||||
|
||||
### Step 4.4: Sandbox command execution with process tracing
|
||||
- `Sandbox::exec_command(cmd, args, working_dir)` spawns child process with Landlock applied
|
||||
- Captures stdout/stderr, enforces timeout
|
||||
- Tracing:
|
||||
- `info!` on command spawn: command, args, working_dir, timeout
|
||||
- `debug!` on command completion: exit code, stdout/stderr byte lengths, duration
|
||||
- `warn!` on non-zero exit codes
|
||||
- `error!` on timeout or spawn failure with full context
|
||||
- `trace!` for Landlock application to child process thread
|
||||
- **Files:** `src/sandbox/mod.rs` or `src/sandbox/exec.rs`
|
||||
- **Done when:** unit test runs `echo hello` in tempdir; write outside sandbox fails (on supported kernels)
|
||||
1. Tool executor tarpc interface -- the orchestrator calls tools directly rather than
|
||||
via a tarpc client/server split as DESIGN.md specifies. This is the biggest
|
||||
structural gap and a prerequisite for sub-agents (each agent gets its own client).
|
||||
2. Session logging (JSONL, tree-addressable) -- no `session/` module exists yet.
|
||||
3. Token tracking -- counts are debug-logged but not surfaced to the user.
|
||||
4. TUI introspection -- tool blocks and thinking traces cannot be expanded/collapsed.
|
||||
5. Status bar is sparse -- no token totals, no activity mode, no network state badge.
|
||||
6. Planning Mode -- no dedicated harness instantiation with restricted sandbox.
|
||||
7. Sub-agents -- no spawning mechanism, no independent context windows.
|
||||
8. Space-bar leader key and which-key help overlay are absent.
|
||||
|
||||
### Step 4.5: Wire tools through Sandbox
|
||||
- Change `Tool::execute` signature to accept `&Sandbox` instead of (or in addition to) `&Path`
|
||||
- Update all 4 built-in tools to call `Sandbox` methods instead of `std::fs`/`std::process::Command`
|
||||
- Remove direct `std::fs` usage from tool implementations
|
||||
- Update `ToolRegistry` and orchestrator to pass `Sandbox`
|
||||
- Tracing: tools now inherit sandbox spans automatically via `#[instrument]`
|
||||
- **Files:** `src/tools/*.rs`, `src/tools/mod.rs`, `src/core/orchestrator.rs`
|
||||
- **Done when:** all existing tool tests pass through Sandbox; no direct `std::fs` in tool files; `RUST_LOG=debug cargo run` shows sandbox operations during tool execution
|
||||
---
|
||||
|
||||
### Step 4.6: Network toggle
|
||||
- `network_allowed: bool` in `SandboxPolicy`
|
||||
- `:net on/off` TUI command parsed in input handler, sent as `UserAction::SetNetworkPolicy(bool)`
|
||||
- Orchestrator updates `Sandbox` policy. Status bar shows network state.
|
||||
- Only available when Landlock ABI >= 4 (kernel 6.7+); command hidden otherwise
|
||||
- Status bar shows: network state when available, "UNSANDBOXED" in `--yolo` mode
|
||||
- Tracing: `info!` on network policy change
|
||||
- **Files:** `src/tui/input.rs`, `src/tui/render.rs`, `src/core/types.rs`, `src/core/orchestrator.rs`, `src/sandbox/mod.rs`
|
||||
- **Done when:** toggling `:net` updates status bar; Landlock network restriction applied on ABI >= 4
|
||||
## Phase 1 -- Tool Executor tarpc Interface
|
||||
|
||||
### Step 4.7: Integration tests
|
||||
- Tools + Sandbox in tempdir: write confinement, path traversal rejection, shell command confinement
|
||||
- Skip Landlock-specific assertions on ABI < 4
|
||||
- Test `--yolo` mode: sandbox constructed but no kernel enforcement
|
||||
- Test startup gate: verify error on ABI < 4 without `--yolo`
|
||||
- Tests should assert tracing output where relevant (use `tracing-test` crate or `tracing_subscriber::fmt::TestWriter`)
|
||||
- **Files:** `tests/sandbox.rs`
|
||||
- **Done when:** `cargo test --test sandbox` passes
|
||||
**Goal:** Introduce the harness/executor split described in DESIGN.md. The executor
|
||||
owns the `ToolRegistry` and `Sandbox`; the orchestrator (harness) communicates with
|
||||
it exclusively through a tarpc client. In this phase the transport is in-process
|
||||
(tarpc's unbounded channel pair), laying the groundwork for out-of-process execution
|
||||
in a later phase.
|
||||
|
||||
### Phase 4 verification (end-to-end)
|
||||
1. `cargo test` -- all tests pass
|
||||
2. `cargo clippy -- -D warnings` -- zero warnings
|
||||
3. `RUST_LOG=debug cargo run -- --project-dir .` -- ask Claude to read a file, observe sandbox trace logs showing path validation and Landlock policy
|
||||
4. Ask Claude to write a file outside project dir -- sandbox denies with `warn!` log
|
||||
5. Ask Claude to run a shell command -- observe command spawn/completion trace
|
||||
6. `:net off` then ask for network access -- verify blocked
|
||||
7. Without `--yolo` on ABI < 4: verify startup refuses with clear error
|
||||
8. With `--yolo`: verify startup succeeds, "UNSANDBOXED" in status bar, `warn!` in logs
|
||||
This is the largest structural change in the plan. Every subsequent phase benefits
|
||||
from the cleaner boundary: sub-agents each get their own executor client (Phase 7),
|
||||
and the sandbox policy becomes a constructor argument to the executor rather than
|
||||
something threaded through the orchestrator.
|
||||
|
||||
### 1.1 Define the tarpc service
|
||||
|
||||
Create `src/executor/mod.rs`:
|
||||
|
||||
```rust
|
||||
#[tarpc::service]
|
||||
pub trait Executor {
|
||||
/// Return the full list of tools this executor exposes, including their
|
||||
/// JSON Schema input descriptors. The harness calls this once at startup
|
||||
/// and caches the result for the lifetime of the conversation.
|
||||
async fn list_tools() -> Vec<ToolDefinition>;
|
||||
|
||||
/// Invoke a single tool by name with a JSON-encoded argument object.
|
||||
/// Returns the text content to feed back to the model, or an error string
|
||||
/// that is also fed back (so the model can self-correct).
|
||||
async fn call_tool(name: String, input: serde_json::Value) -> Result<String, String>;
|
||||
}
|
||||
```
|
||||
|
||||
`ToolDefinition` is already defined in `core/types.rs` and is provider-agnostic --
|
||||
no new types are needed on the wire.
|
||||
|
||||
### 1.2 Implement `ExecutorServer`
|
||||
|
||||
Still in `src/executor/mod.rs`, add:
|
||||
|
||||
```rust
|
||||
pub struct ExecutorServer {
|
||||
registry: ToolRegistry,
|
||||
sandbox: Arc<Sandbox>,
|
||||
}
|
||||
|
||||
impl ExecutorServer {
|
||||
pub fn new(registry: ToolRegistry, sandbox: Sandbox) -> Self { ... }
|
||||
}
|
||||
|
||||
impl Executor for ExecutorServer {
|
||||
async fn list_tools(self, _: Context) -> Vec<ToolDefinition> {
|
||||
self.registry.definitions()
|
||||
}
|
||||
|
||||
async fn call_tool(self, _: Context, name: String, input: Value) -> Result<String, String> {
|
||||
match self.registry.get(&name) {
|
||||
None => Err(format!("unknown tool: {name}")),
|
||||
Some(tool) => tool
|
||||
.execute(input, &self.sandbox)
|
||||
.await
|
||||
.map_err(|e| e.to_string()),
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
The `Arc<Sandbox>` is required because tarpc clones the server struct per request.
|
||||
|
||||
### 1.3 In-process transport helper
|
||||
|
||||
Add a function to `src/executor/mod.rs` (and re-export from `src/app/mod.rs`) that
|
||||
wires an `ExecutorServer` to a client over tarpc's in-memory channel:
|
||||
|
||||
```rust
|
||||
/// Spawn an ExecutorServer on the current tokio runtime and return a client
|
||||
/// connected to it via an in-process channel. The server task runs until
|
||||
/// the client is dropped.
|
||||
pub fn spawn_local(server: ExecutorServer) -> ExecutorClient {
|
||||
let (client_transport, server_transport) = tarpc::transport::channel::unbounded();
|
||||
let server = tarpc::server::BaseChannel::with_defaults(server_transport);
|
||||
tokio::spawn(server.execute(ExecutorServer::serve(/* ... */)));
|
||||
ExecutorClient::new(tarpc::client::Config::default(), client_transport).spawn()
|
||||
}
|
||||
```
|
||||
|
||||
### 1.4 Refactor `Orchestrator` to use the client
|
||||
|
||||
Currently `Orchestrator<P>` holds `ToolRegistry` and `Sandbox` directly and calls
|
||||
`tool.execute(input, &sandbox)` in `run_turn`. Replace these fields with:
|
||||
|
||||
```rust
|
||||
executor: ExecutorClient,
|
||||
tool_definitions: Vec<ToolDefinition>, // fetched once at construction
|
||||
```
|
||||
|
||||
`run_turn` changes from direct tool dispatch to:
|
||||
|
||||
```rust
|
||||
let result = self.executor
|
||||
.call_tool(context::current(), name, input)
|
||||
.await;
|
||||
```
|
||||
|
||||
The `tool_definitions` vec is passed to `provider.stream()` instead of being built
|
||||
from the registry on each call.
|
||||
|
||||
### 1.5 Update `app/mod.rs`
|
||||
|
||||
Replace the inline construction of `ToolRegistry + Sandbox` in `app::run` with:
|
||||
|
||||
```rust
|
||||
let registry = build_tool_registry();
|
||||
let sandbox = Sandbox::new(policy, project_dir, enforcement)?;
|
||||
let executor = executor::spawn_local(ExecutorServer::new(registry, sandbox));
|
||||
let orchestrator = Orchestrator::new(provider, executor, system_prompt);
|
||||
```
|
||||
|
||||
### 1.6 Tests
|
||||
|
||||
- Unit: `ExecutorServer::call_tool` with a mock `ToolRegistry` returns correct
|
||||
output and maps errors to `Err(String)`.
|
||||
- Integration: `spawn_local` -> `client.call_tool` round-trip through the in-process
|
||||
channel executes a real `read_file` against a temp dir.
|
||||
- Integration: existing orchestrator integration tests continue to pass after the
|
||||
refactor (the mock provider path is unchanged; only tool dispatch changes).
|
||||
|
||||
### 1.7 Files touched
|
||||
|
||||
| Action | File |
|
||||
|--------|------|
|
||||
| New | `src/executor/mod.rs` |
|
||||
| Modified | `src/core/orchestrator.rs` -- remove registry/sandbox, add executor client |
|
||||
| Modified | `src/app/mod.rs` -- construct executor, pass client to orchestrator |
|
||||
| Modified | `Cargo.toml` -- add `tarpc` with `tokio1` feature |
|
||||
|
||||
New dependency: `tarpc` (with `tokio1` and `serde-transport` features).
|
||||
|
||||
---
|
||||
|
||||
## Phase 2 -- Session Logging
|
||||
|
||||
**Goal:** Persist every event to a JSONL file. This is the foundation for token
|
||||
accounting, session resume, and future conversation branching.
|
||||
|
||||
### 1.1 Add `src/session/` module
|
||||
|
||||
Create `src/session/mod.rs` with the following public surface:
|
||||
|
||||
```rust
|
||||
pub struct SessionWriter { ... }
|
||||
|
||||
impl SessionWriter {
|
||||
/// Open (or create) a JSONL log at the given path in append mode.
|
||||
pub async fn open(path: &Path) -> Result<Self, SessionError>;
|
||||
|
||||
/// Append one event. Never rewrites history.
|
||||
pub async fn append(&self, event: &LogEvent) -> Result<(), SessionError>;
|
||||
}
|
||||
|
||||
pub struct SessionReader { ... }
|
||||
|
||||
impl SessionReader {
|
||||
pub async fn load(path: &Path) -> Result<Vec<LogEvent>, SessionError>;
|
||||
}
|
||||
```
|
||||
|
||||
### 1.2 Define `LogEvent`
|
||||
|
||||
```rust
|
||||
pub struct LogEvent {
|
||||
pub id: Uuid,
|
||||
pub parent_id: Option<Uuid>,
|
||||
pub timestamp: DateTime<Utc>,
|
||||
pub payload: LogPayload,
|
||||
pub token_usage: Option<TokenUsage>,
|
||||
}
|
||||
|
||||
pub enum LogPayload {
|
||||
UserMessage { content: String },
|
||||
AssistantMessage { content: Vec<ContentBlock> },
|
||||
ToolCall { tool_name: String, input: serde_json::Value },
|
||||
ToolResult { tool_use_id: String, content: String, is_error: bool },
|
||||
}
|
||||
|
||||
pub struct TokenUsage {
|
||||
pub input: u32,
|
||||
pub output: u32,
|
||||
pub cache_read: Option<u32>,
|
||||
pub cache_write: Option<u32>,
|
||||
}
|
||||
```
|
||||
|
||||
`id` and `parent_id` form a tree that enables future branching. For now the
|
||||
conversation is linear so `parent_id` is always the id of the previous event.
|
||||
|
||||
### 1.3 Wire into Orchestrator
|
||||
|
||||
- `Orchestrator` holds an `Option<SessionWriter>`.
|
||||
- Every time the orchestrator pushes to `ConversationHistory` it also appends a
|
||||
`LogEvent`. Token counts from `StreamEvent::InputTokens` / `OutputTokens` are
|
||||
stored on the final assistant event of each turn.
|
||||
- Session file lives at `.skate/sessions/<timestamp>.jsonl`.
|
||||
|
||||
### 1.4 Tests
|
||||
|
||||
- Unit: `SessionWriter::append` then `SessionReader::load` round-trips all payload
|
||||
variants.
|
||||
- Unit: parent_id chain is correct across a simulated multi-turn exchange.
|
||||
- Integration: run the orchestrator with a mock provider against a temp dir; assert
|
||||
the JSONL file is written.
|
||||
|
||||
---
|
||||
|
||||
## Phase 3 -- Token Tracking & Status Bar
|
||||
|
||||
**Goal:** Surface token usage in the TUI per-turn and cumulatively.
|
||||
|
||||
### 3.1 Per-turn token counts in UIEvent
|
||||
|
||||
Add a variant to `UIEvent`:
|
||||
|
||||
```rust
|
||||
UIEvent::TurnComplete { input_tokens: u32, output_tokens: u32 }
|
||||
```
|
||||
|
||||
The orchestrator already receives `StreamEvent::InputTokens` and `OutputTokens`;
|
||||
it should accumulate them during a turn and emit them in `TurnComplete`.
|
||||
|
||||
### 3.2 AppState token counters
|
||||
|
||||
Add to `AppState`:
|
||||
|
||||
```rust
|
||||
pub turn_input_tokens: u32,
|
||||
pub turn_output_tokens: u32,
|
||||
pub total_input_tokens: u64,
|
||||
pub total_output_tokens: u64,
|
||||
```
|
||||
|
||||
`events.rs` updates these on `TurnComplete`.
|
||||
|
||||
### 3.3 Status bar redesign
|
||||
|
||||
The status bar currently shows only the mode indicator. Expand it to four sections:
|
||||
|
||||
```
|
||||
[ MODE ] [ ACTIVITY ] [ i:1234 o:567 | total i:9999 o:2345 ] [ NET: off ]
|
||||
```
|
||||
|
||||
- **MODE** -- Normal / Insert / Command
|
||||
- **ACTIVITY** -- Plan / Execute (Phase 4 adds Plan; for now always "Execute")
|
||||
- **Tokens** -- per-turn input/output, then session cumulative
|
||||
- **NET** -- `on` (green) or `off` (red) reflecting `network_allowed`
|
||||
|
||||
Update `render.rs` to implement this layout using Ratatui `Layout::horizontal`.
|
||||
|
||||
### 3.4 Tests
|
||||
|
||||
- Unit: `AppState` accumulates totals correctly across multiple `TurnComplete` events.
|
||||
- TUI snapshot test (TestBackend): status bar renders all four sections with correct
|
||||
content after a synthetic `TurnComplete`.
|
||||
|
||||
---
|
||||
|
||||
## Phase 4 -- TUI Introspection (Expand/Collapse)
|
||||
|
||||
**Goal:** Support progressive disclosure -- tool calls and thinking traces start
|
||||
collapsed; the user can expand them.
|
||||
|
||||
### 4.1 Block model
|
||||
|
||||
Replace the flat `Vec<DisplayMessage>` in `AppState` with a `Vec<DisplayBlock>`:
|
||||
|
||||
```rust
|
||||
pub enum DisplayBlock {
|
||||
UserMessage { content: String },
|
||||
AssistantText { content: String },
|
||||
ToolCall {
|
||||
display: ToolDisplay,
|
||||
result: Option<String>,
|
||||
expanded: bool,
|
||||
},
|
||||
Error { message: String },
|
||||
}
|
||||
```
|
||||
|
||||
### 4.2 Navigation in Normal mode
|
||||
|
||||
Add block-level cursor to `AppState`:
|
||||
|
||||
```rust
|
||||
pub focused_block: Option<usize>,
|
||||
```
|
||||
|
||||
Keybindings (Normal mode):
|
||||
|
||||
| Key | Action |
|
||||
|-----|--------|
|
||||
| `[` | Move focus to previous block |
|
||||
| `]` | Move focus to next block |
|
||||
| `Enter` or `Space` | Toggle `expanded` on focused ToolCall block |
|
||||
| `j` / `k` | Line scroll (unchanged) |
|
||||
|
||||
The focused block is highlighted with a distinct border color.
|
||||
|
||||
### 4.3 Render changes
|
||||
|
||||
`render.rs` must calculate the height of each `DisplayBlock` depending on whether
|
||||
it is collapsed (1-2 summary lines) or expanded (full content). The scroll offset
|
||||
operates on pixel-rows, not message indices.
|
||||
|
||||
Collapsed tool call shows: `> tool_name(arg_summary) -- result_summary`
|
||||
Expanded tool call shows: full input and output as formatted by `tool_display.rs`.
|
||||
|
||||
### 4.4 Tests
|
||||
|
||||
- Unit: toggling `expanded` on a `ToolCall` block changes height calculation.
|
||||
- TUI snapshot: collapsed vs expanded render output for `WriteFile` and `ShellExec`.
|
||||
|
||||
---
|
||||
|
||||
## Phase 5 -- Space-bar Leader Key & Which-Key Overlay
|
||||
|
||||
**Goal:** Support vim-style `<Space>` leader chords for configuration actions. This
|
||||
replaces the `:net on` / `:net off` text commands with discoverable hotkeys.
|
||||
|
||||
### 5.1 Leader key state machine
|
||||
|
||||
Extend `AppState` with:
|
||||
|
||||
```rust
|
||||
pub leader_active: bool,
|
||||
pub leader_timeout: Option<Instant>,
|
||||
```
|
||||
|
||||
In Normal mode, pressing `Space` sets `leader_active = true` and starts a 1-second
|
||||
timeout. The next key is dispatched through the chord table. If the timeout fires
|
||||
or an unbound key is pressed, leader mode is cancelled with a brief status message.
|
||||
|
||||
### 5.2 Initial chord table
|
||||
|
||||
| Chord | Action |
|
||||
|-------|--------|
|
||||
| `<Space> n` | Toggle network policy |
|
||||
| `<Space> c` | Clear history (`:clear`) |
|
||||
| `<Space> p` | Switch to Plan mode (Phase 5) |
|
||||
| `<Space> ?` | Toggle which-key overlay |
|
||||
|
||||
### 5.3 Which-key overlay
|
||||
|
||||
A centered popup rendered over the output pane that lists all available chords and
|
||||
their descriptions. Rendered only when `leader_active = true` (after a short delay,
|
||||
~200 ms, to avoid flicker on fast typists).
|
||||
|
||||
### 5.4 Remove `:net on/off` from command parser
|
||||
|
||||
Once leader-key network toggle is in place, remove the text-command duplicates to
|
||||
keep the command palette small and focused.
|
||||
|
||||
### 5.5 Tests
|
||||
|
||||
- Unit: leader key state machine transitions (activate, timeout, chord match, cancel).
|
||||
- TUI snapshot: which-key overlay renders with correct chord list.
|
||||
|
||||
---
|
||||
|
||||
## Phase 6 -- Planning Mode
|
||||
|
||||
**Goal:** A dedicated planning harness with restricted sandbox that writes a single
|
||||
plan file, plus a mechanism to pipe the plan into an execute harness.
|
||||
|
||||
### 6.1 Plan harness sandbox policy
|
||||
|
||||
In planning mode the orchestrator is instantiated with a `SandboxPolicy` that grants:
|
||||
|
||||
- `/` -- ReadOnly (same as execute)
|
||||
- `<project_dir>/.skate/plan.md` -- ReadWrite (only this file)
|
||||
- Network -- off
|
||||
|
||||
All other write attempts fail with a sandbox permission error returned to the model.
|
||||
|
||||
### 6.2 Survey tool
|
||||
|
||||
Add a new tool `ask_user` that allows the model to present structured questions to
|
||||
the user during planning:
|
||||
|
||||
```rust
|
||||
// Input schema
|
||||
{
|
||||
"question": "string",
|
||||
"options": ["string"] | null // null means free-text answer
|
||||
}
|
||||
```
|
||||
|
||||
The orchestrator sends a new `UIEvent::SurveyRequest { question, options }`. The TUI
|
||||
renders an inline prompt. The user's answer is sent back as a `UserAction::SurveyResponse`.
|
||||
|
||||
### 6.3 TUI activity mode
|
||||
|
||||
`AppState` gets:
|
||||
|
||||
```rust
|
||||
pub activity: Activity,
|
||||
|
||||
pub enum Activity { Plan, Execute }
|
||||
```
|
||||
|
||||
Switching activity (via `<Space> p`) instantiates a new orchestrator on a fresh
|
||||
channel pair. The old orchestrator is shut down cleanly. The status bar ACTIVITY
|
||||
section updates.
|
||||
|
||||
### 6.4 Plan -> Execute handoff
|
||||
|
||||
When the user is satisfied with the plan (`<Space> x` or `:exec`):
|
||||
|
||||
1. TUI reads `.skate/plan.md`.
|
||||
2. Constructs a new system prompt: `<original system prompt>\n\n## Plan\n<plan content>`.
|
||||
3. Instantiates an Execute orchestrator with the full sandbox policy and the
|
||||
augmented system prompt.
|
||||
4. Transitions `activity` to `Execute`.
|
||||
|
||||
The old Plan orchestrator is dropped.
|
||||
|
||||
### 6.5 Edit plan in $EDITOR
|
||||
|
||||
Hotkey `<Space> e` (or `:edit-plan`) suspends the TUI (restores terminal), opens
|
||||
`$EDITOR` on `.skate/plan.md`, then resumes the TUI after the editor exits.
|
||||
|
||||
### 6.6 Tests
|
||||
|
||||
- Integration: plan harness rejects write to a file other than plan.md.
|
||||
- Integration: survey tool round-trip through channel boundary.
|
||||
- Unit: plan -> execute handoff produces correct augmented system prompt.
|
||||
|
||||
---
|
||||
|
||||
## Phase 7 -- Sub-Agents
|
||||
|
||||
**Goal:** The model can spawn independent sub-agents with their own context windows.
|
||||
Results are summarised and returned to the parent.
|
||||
|
||||
### 7.1 `spawn_agent` tool
|
||||
|
||||
Add a new tool with input schema:
|
||||
|
||||
```rust
|
||||
{
|
||||
"task": "string", // instruction for the sub-agent
|
||||
"sandbox": { // optional policy overrides
|
||||
"network": bool,
|
||||
"extra_write_paths": ["string"]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 7.2 Sub-agent lifecycle
|
||||
|
||||
When `spawn_agent` executes:
|
||||
|
||||
1. Create a new `Orchestrator` with an independent conversation history.
|
||||
2. The sub-agent's system prompt is the parent's system prompt plus the task
|
||||
description.
|
||||
3. The sub-agent runs autonomously (no user interaction) until it emits a
|
||||
`UserAction::Quit` equivalent or hits `MAX_TOOL_ITERATIONS`.
|
||||
4. The final assistant message is returned as the tool result (the "summary").
|
||||
5. The sub-agent's session is logged to a child JSONL file linked to the parent
|
||||
session by a `parent_session_id` field.
|
||||
|
||||
### 7.3 TUI sub-agent view
|
||||
|
||||
The agent tree is accessible via `<Space> a`. A side panel shows:
|
||||
|
||||
```
|
||||
Parent
|
||||
+-- sub-agent 1 [running]
|
||||
+-- sub-agent 2 [done]
|
||||
```
|
||||
|
||||
Pressing Enter on a sub-agent opens a read-only replay of its conversation (scroll
|
||||
only, no input). This is a stretch goal within this phase -- the core spawning
|
||||
mechanism is the priority.
|
||||
|
||||
### 7.4 Tests
|
||||
|
||||
- Integration: spawn_agent with a mock provider runs to completion and returns a
|
||||
summary string.
|
||||
- Unit: sub-agent session file has correct parent_session_id link.
|
||||
- Unit: MAX_TOOL_ITERATIONS limit is respected within sub-agents.
|
||||
|
||||
In this phase `spawn_agent` gains a natural implementation: it calls
|
||||
`executor::spawn_local` with a new `ExecutorServer` configured for the child policy,
|
||||
constructs a new `Orchestrator` with that client, and runs it to completion. The
|
||||
tarpc boundary from Phase 1 makes this straightforward.
|
||||
|
||||
---
|
||||
|
||||
## Phase 8 -- Prompt Caching
|
||||
|
||||
**Goal:** Use Anthropic's prompt caching to reduce cost and latency on long
|
||||
conversations. DESIGN.md notes this as a desired property of message construction.
|
||||
|
||||
### 8.1 Cache breakpoints
|
||||
|
||||
The Anthropic API supports `"cache_control": {"type": "ephemeral"}` on message
|
||||
content blocks. The optimal strategy is to mark the last user message of the longest
|
||||
stable prefix as a cache write point.
|
||||
|
||||
In `provider/claude.rs`, when serializing the messages array:
|
||||
|
||||
- Mark the system prompt content block with `cache_control` (it never changes).
|
||||
- Mark the penultimate user message with `cache_control` (the conversation history
|
||||
that is stable for the current turn).
|
||||
|
||||
### 8.2 Cache token tracking
|
||||
|
||||
The `TokenUsage` struct in `session/` already reserves `cache_read` and
|
||||
`cache_write` fields. `StreamEvent` must be extended:
|
||||
|
||||
```rust
|
||||
StreamEvent::CacheReadTokens(u32),
|
||||
StreamEvent::CacheWriteTokens(u32),
|
||||
```
|
||||
|
||||
The Anthropic `message_start` event contains `usage.cache_read_input_tokens` and
|
||||
`usage.cache_creation_input_tokens`. Parse these and emit the new variants.
|
||||
|
||||
### 8.3 Status bar update
|
||||
|
||||
Add cache tokens to the status bar display: `i:1234(c:800) o:567`.
|
||||
|
||||
### 8.4 Tests
|
||||
|
||||
- Provider unit test: replay a fixture that contains cache token fields; assert the
|
||||
new StreamEvent variants are emitted.
|
||||
- Snapshot test: status bar renders cache token counts correctly.
|
||||
|
||||
---
|
||||
|
||||
## Dependency Graph
|
||||
|
||||
```
|
||||
Phase 1 (tarpc executor)
|
||||
|
|
||||
+-- Phase 2 (session logging) -- orchestrator refactor is complete
|
||||
| |
|
||||
| +-- Phase 3 (token tracking) -- requires session TokenUsage struct
|
||||
| |
|
||||
| +-- Phase 7 (sub-agents) -- requires session parent_session_id
|
||||
|
|
||||
+-- Phase 7 (sub-agents) -- spawn_local reuse is natural after Phase 1
|
||||
|
||||
Phase 4 (expand/collapse) -- independent, can be done alongside Phase 3
|
||||
|
||||
Phase 5 (leader key) -- independent, prerequisite for Phase 6
|
||||
|
||||
Phase 6 (planning mode) -- requires Phase 5 (leader key chord <Space> p)
|
||||
-- benefits from Phase 1 (separate executor per activity)
|
||||
|
||||
Phase 8 (prompt caching) -- requires Phase 3 (cache token display)
|
||||
```
|
||||
|
||||
Recommended order: 1 -> 2 -> 3 -> 4 -> 5 -> 6 -> 8, with 7 after 2 and 6.
|
||||
|
||||
---
|
||||
|
||||
## Files Touched Per Phase
|
||||
|
||||
| Phase | New Files | Modified Files |
|
||||
|-------|-----------|----------------|
|
||||
| 1 | `src/executor/mod.rs` | `src/core/orchestrator.rs`, `src/core/types.rs`, `src/app/mod.rs`, `Cargo.toml` |
|
||||
| 2 | `src/session/mod.rs` | `src/core/orchestrator.rs`, `src/app/mod.rs` |
|
||||
| 3 | -- | `src/core/types.rs`, `src/core/orchestrator.rs`, `src/tui/events.rs`, `src/tui/render.rs` |
|
||||
| 4 | -- | `src/tui/mod.rs`, `src/tui/render.rs`, `src/tui/events.rs`, `src/tui/input.rs` |
|
||||
| 5 | -- | `src/tui/input.rs`, `src/tui/render.rs`, `src/tui/mod.rs` |
|
||||
| 6 | `src/tools/ask_user.rs` | `src/core/types.rs`, `src/core/orchestrator.rs`, `src/tui/mod.rs`, `src/tui/input.rs`, `src/tui/render.rs`, `src/app/mod.rs` |
|
||||
| 7 | -- | `src/executor/mod.rs`, `src/core/orchestrator.rs`, `src/tui/render.rs`, `src/tui/input.rs` |
|
||||
| 8 | -- | `src/provider/claude.rs`, `src/core/types.rs`, `src/session/mod.rs`, `src/tui/render.rs` |
|
||||
|
||||
---
|
||||
|
||||
## New Dependencies
|
||||
|
||||
| Crate | Phase | Reason |
|
||||
|-------|-------|--------|
|
||||
| `tarpc` | 1 | RPC service trait + in-process transport |
|
||||
| `uuid` | 2 | LogEvent ids |
|
||||
| `chrono` | 2 | Event timestamps (check if already transitive) |
|
||||
|
||||
No other new dependencies are needed. All other required functionality
|
||||
(`serde_json`, `tokio`, `ratatui`, `tracing`) is already present.
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue