Architecture¶
A single HTTP server process receives GitHub webhook events, acknowledges within ten seconds, and asynchronously hands each event to a daemon for execution. Every event walks the same path: verify → route → classify → enqueue → daemon claims the job → run the pipeline → finalise the tracking comment.
Request flow¶
flowchart TD
GH["GitHub webhook<br/>POST /api/github/webhooks"]:::entry
VERIFY["Verify HMAC-SHA256"]:::guard
ACK["200 OK within 10 seconds"]:::ack
ROUTE["Router<br/>idempotency + allowlist + concurrency"]:::guard
TR["Haiku triage<br/>binary heavy classifier"]:::decide
QUEUE["Orchestrator job queue<br/>Valkey list"]:::store
SCALE{{"Scale-up decision<br/>heavy OR queue >= threshold<br/>AND no persistent slots<br/>AND cooldown elapsed"}}:::fork
SPAWN["K8s API<br/>create bare Pod<br/>DAEMON_EPHEMERAL=true"]:::decide
FLEET["Daemon fleet<br/>persistent + ephemeral<br/>WebSocket connections"]:::target
PIPE["runPipeline<br/>clone + prompt + Claude Agent SDK"]:::work
FIN["Finalise tracking comment<br/>success, error, or cost summary"]:::done
GH --> VERIFY --> ACK
ACK -. async .-> ROUTE
ROUTE --> TR
TR --> QUEUE
QUEUE --> SCALE
SCALE -->|yes| SPAWN
SPAWN --> FLEET
SCALE -->|no, or cooldown active| FLEET
QUEUE -->|JobOffer| FLEET
FLEET --> PIPE
PIPE --> FIN
classDef entry fill:#0b5cad,stroke:#083e74,color:#ffffff
classDef guard fill:#164a3a,stroke:#0d2c24,color:#ffffff
classDef ack fill:#2a6f2a,stroke:#1a4d1a,color:#ffffff
classDef decide fill:#8a5a00,stroke:#5c3d00,color:#ffffff
classDef fork fill:#6a2080,stroke:#451454,color:#ffffff
classDef target fill:#114a82,stroke:#0a2f56,color:#ffffff
classDef work fill:#4a2e7a,stroke:#311f50,color:#ffffff
classDef store fill:#5c3d00,stroke:#3d2900,color:#ffffff
classDef done fill:#2a6f2a,stroke:#1a4d1a,color:#ffffff
Key concepts¶
- Async processing. The webhook handler responds within ten seconds, so the router fires
processRequestwith fire-and-forget semantics after the 200 OK is queued. Every box downstream ofACKruns after the HTTP response is on the wire. - Two-layer idempotency. The fast path is an in-memory
Mapkeyed byX-GitHub-Delivery. The durable path (isAlreadyProcessedinsrc/core/tracking-comment.ts) scans GitHub issue/PR comments for the hidden delivery marker the bot embeds in the tracking comment, so duplicate deliveries are detected across pod restarts, OOM kills, and crash loops — this works withoutDATABASE_URL.DATABASE_URLis required only to persist execution / dispatch history across restarts. - One request, one clone. Each delivery clones the repo into a unique temp directory under
CLONE_BASE_DIRon the daemon host. Claude operates on local files viacwd. On PR events the checkout supplementally fetchesorigin/<baseBranch>(when it differs from the head ref) so the agent'sgit diff origin/<baseBranch>...HEADandgit rebase origin/<baseBranch>directives resolve first try. A sibling${workDir}-artifactsdirectory is created outside the checkout and exposed to the agent asBOT_ARTIFACT_DIR— workflow summary files (IMPLEMENT.md / REVIEW.md / RESOLVE.md) are written there so they can never be picked up by agit addinside the clone. Both directories are removed in the pipeline'sfinallyblock regardless of outcome. - GitHub credential resolution.
src/core/github-token.ts:resolveGithubToken()is the single source of the GitHub credential the daemon uses. Default is an App installation token minted just-in-time from the cachedAppsingleton insrc/orchestrator/connection-handler.ts. WhenGITHUB_PERSONAL_ACCESS_TOKENis set, the helper short-circuits and returns the PAT instead — API/git authentication runs as the PAT owner. Commit author/committer metadata is not affected;src/core/checkout.tshard-pins gituser.name/user.emailtochrisleekr-bot[bot]so commit objects still carry the bot identity. The git credential helper, executorGH_TOKEN/GITHUB_TOKENenv vars, and MCP server env all consume the resolved string without caring about its source. - The webhook server never runs the pipeline. Only daemons execute
runPipeline. The webhook server is the orchestrator — it enqueues jobs and optionally spawns ephemeral daemons. - Every orchestrator runs a queue worker.
src/orchestrator/queue-worker.tspollsqueue:jobsviaLMOVEinto a per-instance processing list (queue:processing:{instanceId}), offers the job to a locally-connected daemon, and atomically re-queues it to the head when no local daemon can take it. Multi-orchestrator HA:LMOVEgrants exactly-once claim across instances; the offer/accept round-trip stays in-process. Crash recovery is handled by each orchestrator draining its own processing list at startup, plus a cross-instance reaper (src/orchestrator/valkey-cleanup.ts) draining processing lists owned by instances whoseorchestrator:{id}:aliveliveness key has expired. - MCP servers. Tracking-comment updates, inline PR reviews, scoped review-thread resolves, daemon-capability reports, repo-memory, and (optionally) Context7 library docs are exposed as MCP servers the agent can call. Git changes are made via the Bash tool against the cloned repo, not through a dedicated MCP server.
Dispatch flow¶
Dispatch collapsed to a single target — daemon — in migration 004_collapse_dispatch_to_daemon.sql. Every job is claimed by some daemon in the fleet over WebSocket. The router decides only the reason the job lands there and whether to spawn an ephemeral daemon.
Single target, four reasons¶
Canonical source: src/shared/dispatch-types.ts.
DispatchTarget="daemon"(singleton — kept as a field for DB/log stability).DispatchReasonis one of:
| Reason | When the router sets it |
|---|---|
persistent-daemon |
Routed to an existing persistent daemon. The default, hot path. |
ephemeral-daemon-triage |
Triage flagged the job heavy → orchestrator spawned an ephemeral daemon Pod. |
ephemeral-daemon-overflow |
Queue length ≥ EPHEMERAL_DAEMON_SPAWN_QUEUE_THRESHOLD and persistent pool saturated → spawn drains overflow. |
ephemeral-spawn-failed |
Spawn was required but the K8s API call failed. Job rejected with a tracking-comment infra error. |
Scale-up model¶
The fleet is two-tiered — see ../operate/runbooks/daemon-fleet.md for the operational view. The decision rule:
- Triage. Single-turn Haiku call returns
{heavy, confidence, rationale}.heavy=trueis one trigger. - Overflow.
queue_length ≥ EPHEMERAL_DAEMON_SPAWN_QUEUE_THRESHOLDand persistent free slots = 0 is the other trigger. - Cooldown. Spawns are rate-limited by
EPHEMERAL_DAEMON_SPAWN_COOLDOWN_MS. During cooldown, heavy/overflow signals do not spawn — the job falls back topersistent-daemonand waits. - Spawn. When both a trigger fires and cooldown has elapsed, the orchestrator calls the K8s API to create a bare Pod with
DAEMON_EPHEMERAL=true. Only a true K8s API failure yieldsephemeral-spawn-failed.
The newly-spawned ephemeral daemon connects via WebSocket, registers with isEphemeral: true, claims the job, runs it, then drains and exits after EPHEMERAL_DAEMON_IDLE_TIMEOUT_MS.
WebSocket protocol¶
Schema in src/shared/ws-messages.ts (Zod discriminated union). Validation failures close the WebSocket with POLICY_VIOLATION. Every message has an envelope with id (UUID) and timestamp (ms).
Server → Daemon¶
| Type | Purpose |
|---|---|
daemon:registered |
Handshake response after daemon:register; carries heartbeatIntervalMs, offerTimeoutMs, maxRetries. |
heartbeat:ping |
Periodic liveness ping. |
job:offer |
Offer a workflow-run job. |
scoped-job-offer |
Offer a scoped job (scoped-rebase, scoped-fix-thread, scoped-explain-thread, scoped-open-pr). |
job:payload |
Full BotContext plus overrides (maxTurns, allowedTools, trackingCommentId). Sent after accept. |
job:cancel |
Abort a running job. |
daemon:update-required |
Daemon version mismatch; force exit. |
Daemon → Server¶
| Type | Purpose |
|---|---|
daemon:register |
Initial registration with capabilities, resources, isEphemeral, protocolVersion, appVersion. |
heartbeat:pong |
Refresh TTL; carries activeJobs, current resources. |
job:accept |
Claim an offered job. |
job:reject |
Decline with reason (scoped-kind-unsupported, resource-insufficient, …). |
job:status |
Mid-run progress. |
job:result |
Workflow-run completion with ExecutionResult fields. |
scoped-job:completion |
Scoped job result with kind-specific fields. |
daemon:draining |
Graceful shutdown initiated. |
daemon:update-acknowledged |
Ack for daemon:update-required. |
error |
Generic error envelope. |
PR shepherding bridge¶
The bot:ship lifecycle does not own a separate daemon-execution path. It bridges onto the existing workflow_runs pipeline so a single executor surface clones, runs the Agent SDK, and pushes — no parallel implementation to drift between.
flowchart LR
Trigger["bot:ship trigger<br/>literal / NL / label"]:::input
SR["session-runner.ts"]:::core
Intent[("ship_intents row")]:::store
Cont[("ship_continuations row<br/>wake_at")]:::store
Iter["iteration.runIteration"]:::core
WR[("workflow_runs row<br/>state.shipIntentId")]:::store
Q[("queue:jobs<br/>kind=workflow-run")]:::store
Daemon["Daemon process"]:::work
Exec["src/core/pipeline.ts"]:::work
Done["markSucceeded(runId)"]:::work
Cascade["orchestrator.onStepComplete<br/>maybeEarlyWakeShipIntent"]:::core
Tickle[("ship:tickle ZSET<br/>score=0")]:::store
Timer["tickle-scheduler<br/>setInterval"]:::core
Due["ZRANGEBYSCORE 0 now"]:::core
Resume["session-runner.resumeShipIntent"]:::core
Trigger --> SR --> Intent
SR --> Cont
SR --> Iter --> WR
Iter --> Q
Q --> Daemon --> Exec --> Done --> Cascade --> Tickle
Timer --> Due --> Resume
Tickle --> Due
Resume -. next iteration .-> Iter
classDef input fill:#1f6feb,stroke:#0b3d99,color:#ffffff
classDef core fill:#8957e5,stroke:#4c2889,color:#ffffff
classDef work fill:#7a3b1f,stroke:#3d1d0e,color:#ffffff
classDef store fill:#0e8a16,stroke:#063d09,color:#ffffff
The reactor (fanOut) writes wake_at = now() and ZADD ship:tickle 0 <intent_id> so the next cron tick (typically under 30 s) re-enters the runner. This keeps daemon slots free between iterations and gives the bot crash-restart safety: on boot, tickle-scheduler reconciles missed wakes from Postgres into Valkey before the periodic timer's first tick.
Directory layout¶
| Directory | Responsibility |
|---|---|
src/webhook/ |
Event routing (router.ts) and per-event handlers (events/, one file per event type). |
src/core/ |
Pipeline: context → fetch → format → prompt → checkout → execute → finalise. pipeline.ts is the single execution path (daemon-side). |
src/ai/ |
Provider-agnostic LLM client (Anthropic + Bedrock) used by triage and the intent / NL classifiers. |
src/orchestrator/ |
WebSocket server, daemon registry, job queue, dispatcher, triage, ephemeral-daemon scaler. Embedded in the webhook server process. |
src/daemon/ |
Standalone worker process (persistent or ephemeral). WebSocket client that accepts offers and runs pipeline.ts. |
src/k8s/ |
Ephemeral daemon Pod spawner. |
src/mcp/ |
MCP server registry. |
src/workflows/ |
Registry, dispatcher, composite cascade, ship lifecycle (ship/), per-workflow handlers (handlers/). |
src/db/ |
Postgres layer. Migrations, connection singleton, observability queries. Active when DATABASE_URL is set. |
src/shared/ |
Types shared between server and daemon (WebSocket messages, dispatch enums). |
src/utils/ |
Retry, sanitisation, circuit breaker. |
Further reading¶
- Workflows — registry-driven
bot:*commands. Source of truth:src/workflows/registry.ts. bot:shiplifecycle — verdict ladder, status state machine.- Daemon fleet runbook — persistent vs ephemeral, scaling, K8s.
- Configuration — every environment variable.
- Extending — add a workflow or MCP server.