drew/skate

Fork 0

Drew Galbraith 7b9525ef95 Add command interface and status indicator.

2026-02-24 15:03:25 -08:00

5.7 KiB

Raw Blame History

Design Decisions

Stack

Language: Rust
TUI Framework: Ratatui + Crossterm
Async Runtime: Tokio

Architecture

Channel boundary between TUI and core (fully decoupled)
Module decomposition: app, tui, core, provider, tools, sandbox, session
Headless mode: core without TUI, driven by script (enables benchmarking and CI)

Model Integration

Claude-first, multi-model via ModelProvider trait
Common StreamEvent internal representation across providers
Prompt caching-aware message construction

UI

Agent view: Tree-based hierarchy (not flat tabs) for sub-agent inspection
Modes: Normal, Insert, Command (: prefix from Normal mode)
Activity modes: Plan and Execute are visually distinct activities in the TUI
Streaming: Barebones styled text initially, full markdown rendering deferred
Token usage: Per-turn display (between user inputs), cumulative in status bar
Status bar: Mode indicator, current activity (Plan/Execute), token totals, network policy state

Planning Mode

Distinct activity from execution — planner agent produces a plan file, does not execute
Plan file is structured markdown: steps with descriptions, files involved, acceptance criteria
Plan is reviewable and editable before execution (:edit-plan opens $EDITOR)
User explicitly approves plan before execution begins
Executor agent receives the plan file + project context, not the planning conversation
Plan-step progress tracked during execution (complete/in-progress/failed)

Sub-Agents

Independent context windows with summary passed back to parent
Fully autonomous once spawned
Hard deny on unpermitted actions
Plan executor is a specialized sub-agent where the plan replaces the summary
Direct user interaction with sub-agents deferred

Tool System

Built-in tool system with Tool trait
Core tools: read_file, write_file, edit_file, shell_exec, list_directory, search_files
Approval gates by risk level: auto-approve (reads), confirm (writes/shell), deny
MCP not implemented but interface designed to allow future adapter

Sandboxing

Landlock (Linux kernel-level):
- Read-only: system-wide (/)
- Read-write: project directory, temp directory
- Network: blocked by default, toggleable via :net on/off
Graceful degradation on older kernels
All tool execution goes through Sandbox — tools never touch filesystem directly

Session Logging

JSONL format, one event per line
Events: user message, assistant message, tool call, tool result, sub-agent spawn/result, plan created, plan step status
Tree-addressable via parent IDs (enables conversation branching later)
Token usage stored per event
Linear UX for now, branching deferred

Testing Strategy

Unit Tests

provider: SSE stream parsing from byte fixtures, message/tool serialization, StreamEvent variant correctness
tools: Path canonicalization, traversal prevention, risk level classification, registry dispatch
sandbox: Landlock policy construction, path validation logic (without applying kernel rules)
core: Conversation tree operations (insert, query by parent, turn computation, token totals), orchestrator state machine transitions against mock StreamEvent sequences
session: JSONL serialization roundtrips, parent ID chain reconstruction
tui: Widget rendering via Ratatui TestBackend

Integration Tests — Component Boundaries

Core ↔ Provider: Mock ModelProvider replaying recorded API sessions (full SSE streams with tool use). Tests the complete orchestration loop deterministically without network.
Core ↔ TUI (channel boundary): Orchestrator with mock provider connected to channels. Assert correct UIEvent sequence, inject UserAction messages, verify approval/denial flow.
Tools ↔ Sandbox: Real file operations and shell commands in temp directories. Verify write confinement, path traversal rejection, network blocking. Skip Landlock-specific tests on older kernels in CI.

Integration Tests — End to End

Recorded session replay: Capture real Claude API HTTP request/response pairs, replay deterministically. Exercises full stack (core + channel + mock TUI) without cost or network dependency. Primary E2E test strategy.
Live API tests: Small suite behind feature flag / env var. Verifies real API integration. Run manually before releases, not in CI.

Benchmarking — SWE-bench

Target: SWE-bench Verified (500 curated problems) as primary benchmark
Secondary: SWE-bench Pro for testing planning mode on longer-horizon tasks
Approach: Headless mode (core without TUI) produces unified diff patches, evaluated via SWE-bench Docker harness
Baseline: mini-swe-agent (~100 lines Python, >74% on Verified) as calibration — if we score significantly below with same model, the issue is scaffolding
Cadence: Milestone checks, not continuous CI (too expensive/slow)
Requirements: x86_64, 120GB+ storage, 16GB RAM, 8 CPU cores

Test Sequencing

Phase 1: Unit tests for SSE parser, event types, message serialization
Phase 4: Recorded session replay infrastructure (core loop complex enough to warrant it)
Phase 6-7: Headless mode + first SWE-bench Verified run

Configuration (Deferred)

Single-user, hardcoded defaults for now
Designed for later: global config, per-project .agent.toml, configurable keybindings

Deferred Features

Conversation branching (tree structure in log, linear UX for now)
Direct sub-agent interaction
MCP adapter
Full markdown/syntax-highlighted rendering
Session log viewer
Per-project configuration
Structured plan editor in TUI (use $EDITOR for now)

5.7 KiB Raw Blame History

Design Decisions

Stack

Architecture

Model Integration

UI

Planning Mode

Sub-Agents

Tool System

Sandboxing

Session Logging

Testing Strategy

Unit Tests

Integration Tests — Component Boundaries

Integration Tests — End to End

Benchmarking — SWE-bench

Test Sequencing

Configuration (Deferred)

Deferred Features

5.7 KiB

Raw Blame History