Update the design and PLAN.md (#11)

Reviewed-on: #11
Co-authored-by: Drew Galbraith <drew@tiramisu.one>
Co-committed-by: Drew Galbraith <drew@tiramisu.one>
This commit is contained in:
Drew 2026-03-14 21:52:38 +00:00 committed by Drew
parent 669e05b716
commit 7420755800
2 changed files with 707 additions and 170 deletions

183
DESIGN.md
View file

@ -1,19 +1,96 @@
# Design Decisions
# Skate Design
This is a TUI coding agent harness built for one user. The unique design goals compared
to other coding agents are:
1) Allow autonomous execution without permission prompts without fully sacrificing security.
The user can configure what permissions the coding agent has before execution and these
are enforced using kernel-level sandboxing.
2) The UI supports introspection to better understand how the harness is performing.
Information may start collapsed, but it is possible to introspect things like tool uses
and thinking chains. Additionally token usage is elevated to understand where the harness
is performing inefficiently.
3) The UI is modal and supports neovim like hotkeys for navigation and configuratiorn
(i.e. using the space bar as a leader key). We prefer having hotkeys over adding custom
slash commands (/model) to the text chat interface. The text chat should be reserved for
things that go straight to the underlying model.
## Stack
- **Language:** Rust
- **TUI Framework:** Ratatui + Crossterm
- **Async Runtime:** Tokio
## Architecture
- Channel boundary between TUI and core (fully decoupled)
- Module decomposition: `app`, `tui`, `core`, `provider`, `tools`, `sandbox`, `session`
- Headless mode: core without TUI, driven by script (enables benchmarking and CI)
## Model Integration
- Claude-first, multi-model via `ModelProvider` trait
- Common `StreamEvent` internal representation across providers
- Prompt caching-aware message construction
The coding agent is broken into three main components, the TUI, the harness, and the tool executor.
The harness communicates with the tool executor via a tarpc interface.
The TUI and harness communicate over a Channel boundary and are fully decoupled
in a way that supports running the harness without the TUI (i.e. in scripting mode).
## Harness Design
The harness follows a fairly straightforward design loop.
1. Send message to underlying model.
2. If model requests a tool use, execute it (via a call to the executor) and return to 1.
3. Else, wait for further user input.
### Harness Instantiation
The harness is instantiated with a system prompt and a tarpc client to the tool executor.
(In the first iteration we use an in process channel for the tarpc client).
### Model Integration
The harness uses a trait system to make it agnostic to the underlying coding agent used.
This trait unifies a variety of APIs using a `StreamEvent` interface for streaming responses
from the API.
Currently, only Anthropic's Claude API is supported.
Messages are constructed in such a way to support prompt caching when available.
### Session Logging
- JSONL format, one event per line
- Events: user message, assistant message, tool call, tool result.
- Tree-addressable via parent IDs (enables conversation branching later)
- Token usage stored per event
- Linear UX for now, branching deferred
## Executor Design
The key aspect of the executor design is that is configured with sandbox permissions
that allow tool use without any user prompting. Either the tool use succeeds within the
sandbox and is returned to the model or it fails with a permission error to the model.
The sandboxing allows running arbitrary shell commands without prompting.
### Executor Interface
The executor interface exposed to the harness has the following methods.
- list_available_tools: takes no arguments and returns tool names, descriptions, and argument schema.
- call_tool: takes a tool name and its arguments and returns either a result or an error.
### Sandboxing
Sandboxing is done using the linux kernel feature "Landlock".
This allows restricting file system access (either read only, read/write, or no access)
as well as network access (either on/off).
## TUI Design
The bulk of the complexity of this coding agent is pushed to TUI in this design.
The driving goals of the TUI are:
- Support (neo)vim style keyboard navigation and modal editing.
- Full progressive discloure of information, high level information is grokable at a glance
but full tool use and thinking traces can be expanded.
- Support for instantiating multiple different instances of the core harness (i.e. different
instantiations for code review vs planning vs implementation).
## UI
- **Agent view:** Tree-based hierarchy (not flat tabs) for sub-agent inspection
@ -24,12 +101,17 @@
- **Status bar:** Mode indicator, current activity (Plan/Execute), token totals, network policy state
## Planning Mode
- Distinct activity from execution — planner agent produces a plan file, does not execute
- Plan file is structured markdown: steps with descriptions, files involved, acceptance criteria
- Plan is reviewable and editable before execution (`:edit-plan` opens `$EDITOR`)
- User explicitly approves plan before execution begins
- Executor agent receives the plan file + project context, not the planning conversation
- Plan-step progress tracked during execution (complete/in-progress/failed)
In planning mode the TUI instantiates a harness with read access to the project directory
and write access to a single plan markdown file.
The TUI then provides a glue mechanism that can then pipe that plan into a new instantiation of the
harness in execute mode.
Additionally we specify a schema for "surveys" that allow the model to ask the user questions about
the plan.
We also provide a hotkey (Ctrl+G or :edit-plan) that allows opening the plan in the users `$EDITOR`.
## Sub-Agents
- Independent context windows with summary passed back to parent
@ -38,68 +120,3 @@
- Plan executor is a specialized sub-agent where the plan replaces the summary
- Direct user interaction with sub-agents deferred
## Tool System
- Built-in tool system with `Tool` trait
- Core tools: `read_file`, `write_file`, `edit_file`, `shell_exec`, `list_directory`, `search_files`
- Approval gates by risk level: auto-approve (reads), confirm (writes/shell), deny
- MCP not implemented but interface designed to allow future adapter
## Sandboxing
- **Landlock** (Linux kernel-level):
- Read-only: system-wide (`/`)
- Read-write: project directory, temp directory
- Network: blocked by default, toggleable via `:net on/off`
- Graceful degradation on older kernels
- All tool execution goes through `Sandbox` — tools never touch filesystem directly
## Session Logging
- JSONL format, one event per line
- Events: user message, assistant message, tool call, tool result, sub-agent spawn/result, plan created, plan step status
- Tree-addressable via parent IDs (enables conversation branching later)
- Token usage stored per event
- Linear UX for now, branching deferred
## Testing Strategy
### Unit Tests
- **`provider`:** SSE stream parsing from byte fixtures, message/tool serialization, `StreamEvent` variant correctness
- **`tools`:** Path canonicalization, traversal prevention, risk level classification, registry dispatch
- **`sandbox`:** Landlock policy construction, path validation logic (without applying kernel rules)
- **`core`:** Conversation tree operations (insert, query by parent, turn computation, token totals), orchestrator state machine transitions against mock `StreamEvent` sequences
- **`session`:** JSONL serialization roundtrips, parent ID chain reconstruction
- **`tui`:** Widget rendering via Ratatui `TestBackend`
### Integration Tests — Component Boundaries
- **Core ↔ Provider:** Mock `ModelProvider` replaying recorded API sessions (full SSE streams with tool use). Tests the complete orchestration loop deterministically without network.
- **Core ↔ TUI (channel boundary):** Orchestrator with mock provider connected to channels. Assert correct `UIEvent` sequence, inject `UserAction` messages, verify approval/denial flow.
- **Tools ↔ Sandbox:** Real file operations and shell commands in temp directories. Verify write confinement, path traversal rejection, network blocking. Skip Landlock-specific tests on older kernels in CI.
### Integration Tests — End to End
- **Recorded session replay:** Capture real Claude API HTTP request/response pairs, replay deterministically. Exercises full stack (core + channel + mock TUI) without cost or network dependency. Primary E2E test strategy.
- **Live API tests:** Small suite behind feature flag / env var. Verifies real API integration. Run manually before releases, not in CI.
### Benchmarking — SWE-bench
- **Target:** SWE-bench Verified (500 curated problems) as primary benchmark
- **Secondary:** SWE-bench Pro for testing planning mode on longer-horizon tasks
- **Approach:** Headless mode (core without TUI) produces unified diff patches, evaluated via SWE-bench Docker harness
- **Baseline:** mini-swe-agent (~100 lines Python, >74% on Verified) as calibration — if we score significantly below with same model, the issue is scaffolding
- **Cadence:** Milestone checks, not continuous CI (too expensive/slow)
- **Requirements:** x86_64, 120GB+ storage, 16GB RAM, 8 CPU cores
### Test Sequencing
- Phase 1: Unit tests for SSE parser, event types, message serialization
- Phase 4: Recorded session replay infrastructure (core loop complex enough to warrant it)
- Phase 6-7: Headless mode + first SWE-bench Verified run
## Configuration (Deferred)
- Single-user, hardcoded defaults for now
- Designed for later: global config, per-project `.agent.toml`, configurable keybindings
## Deferred Features
- Conversation branching (tree structure in log, linear UX for now)
- Direct sub-agent interaction
- MCP adapter
- Full markdown/syntax-highlighted rendering
- Session log viewer
- Per-project configuration
- Structured plan editor in TUI (use `$EDITOR` for now)