Observability¶
Structured JSON logs via pino are the primary signal. Every dispatch decision and every pipeline step carries a deliveryId so you can reconstruct a request end-to-end from a single log query. When DATABASE_URL is configured, the same information is persisted to executions and triage_results for aggregate reporting.
Log redaction¶
The exported logger in src/logger.ts is the canonical chokepoint for secret scrubbing — every child logger inherits its redact.paths list and its custom err serializer, so individual call sites do not need to remember to scrub. Two layers run on every emitted line:
-
Path-based redaction — the exported
REDACT_PATHSconstant insrc/logger.tslists every field pino should replace with[Redacted]before the JSON is serialised. Paths covered:authorizationand its*.authorization/headers.authorization/*.headers.authorization/req.headers.authorization/request.headers.authorizationvariants; the webhook signature headerx-hub-signature-256(also wildcard-prefixed);response.data.token; and the named credential fieldstoken,installationToken,privateKey,webhookSecret,anthropicApiKey,claudeCodeOauthToken,daemonAuthToken,awsSecretAccessKey,awsSessionToken,awsBearerTokenBedrock,*.password. The list isObject.freezed so an accidentalpushfrom another module cannot silently weaken the policy. -
errSerializerscrubbing — the exportederrSerializerinsrc/logger.tsdefers to pino'sstdSerializers.errand then runs the result'smessage,stack,request.headers, andresponse.datathroughredactGitHubTokens(src/utils/sanitize.ts) plus an inline credential-URL scrubber that mirrorsredactValkeyUrl(src/orchestrator/valkey.ts). The walker recurses through nested objects/arrays and replaces any key matching the sensitive-field-name set wholesale, soerr.response.data.meta.tokenanderr.request.headers.forwarded.authorizationare caught at any depth — this is necessary because pino's path-based rules cannot match four-or-more segments deep onerr.*. It also catchesghs_…installation tokens and App JWTs echoed insideerr.message/err.stack.
The serializer operates on a copy, so the original Error instance is never mutated.
If you add a new secret-bearing config field to src/config.ts, add its property name to REDACT_PATHS in the same PR. The point helpers redactGitHubTokens and redactValkeyUrl remain in place for their non-log call sites (prompt sanitisation and the Valkey startup info log respectively); the logger config is the system-wide default.
Common log fields¶
| Field | Meaning |
|---|---|
deliveryId |
X-GitHub-Delivery header — stable across every log line for a single webhook. |
event |
GitHub event name (pull_request, issue_comment, …) or canonical event key for ship workflow logs. |
repo |
owner/name of the triggering repo. |
dispatch_target |
Always daemon (singleton — kept as a field for DB/log stability). |
dispatch_reason |
Why the job landed where it did. See Dispatch reasons. |
isEphemeral |
Present on daemon-originating log lines. true if emitted by an ephemeral daemon. |
triage_fallback_reason |
Only present on triage fallbacks — see runbooks/triage.md. |
confidence, heavy, rationale |
Triage outputs on success. |
cost_usd |
Agent-reported total cost from the SDK. |
workflowRunId, workflowName |
UUID of the workflow_runs row + workflow name. Stable per run. |
intentWorkflow, intentConfidence |
Intent-classifier verdict and confidence for comment triggers. |
branch, depth |
Initial clone target — emitted by checkout.ts on the Cloning repository line. |
baseBranch, headBranch |
PR base + head ref — emitted on Fetching PR base branch (info) and the matching warn on fetch fail. |
Ship workflow log fields¶
The shepherding lifecycle emits structured pino lines validated against the canonical Zod schema in src/workflows/ship/log-fields.ts. Field names and types are pinned so emitters cannot drift.
| Field | Type | When present |
|---|---|---|
event |
string (e.g. ship.intent.transition, ship.probe.run, ship.reactor.fanout) |
Always. |
intent_id |
UUID | Always. |
pr |
{owner, repo, number, installation_id} |
Always. |
iteration_n |
non-negative int | Always (0 on pre-iteration events). |
phase |
probe | fix | reply | wait | terminal |
Iteration events. |
from_status / to_status |
session status | Transition events only. |
terminal_blocker_category |
blocker category | Terminal human_took_over transitions. |
non_readiness_reason |
enum | Probe events with non-ready verdict. |
trigger_surface |
literal | nl | label |
Session-start events only. |
principal_login |
string | Session-start events only. |
spent_usd_cents |
non-negative integer | Always — cumulative session spend in cents (integer to avoid binary-fp drift in aggregations). |
wall_clock_ms |
non-negative integer | Always — cumulative session wall-clock. |
delta_usd_cents |
non-negative integer | Per-event spend (iteration events only). |
delta_ms |
non-negative integer | Per-event wall-clock duration. |
The schema is the source of truth. Adding or renaming a field requires updating src/workflows/ship/log-fields.ts; the co-located test round-trips a sample through the schema and rejects unknown / mistyped fields.
Iteration / tickle / scoped event keys¶
Every shepherding emitter draws its event value from the typed SHIP_LOG_EVENTS constant in src/workflows/ship/log-fields.ts. Operators can grep for these literals deterministically.
| Event key | Where it fires | What it indicates |
|---|---|---|
ship.iteration.enqueued |
iteration.runIteration after enqueueJob |
A non-ready verdict bridged into the daemon workflow_runs pipeline. One row per iteration. |
ship.iteration.terminal_cap |
iteration.runIteration cap check |
The intent hit MAX_SHIP_ITERATIONS. |
ship.iteration.terminal_deadline |
iteration.runIteration deadline check |
The intent's deadline_at elapsed. |
ship.tickle.started |
app.ts boot, after tickleScheduler.start() |
The cron tickle scheduler is scanning ship:tickle. |
ship.tickle.due |
orchestrator.onStepComplete early-wake or session-runner.resumeShipIntent |
An intent is being re-entered. source discriminates workflow_run_completion vs scheduler. |
ship.tickle.skip_terminal |
orchestrator.onStepComplete early-wake |
The hook found a shipIntentId but the intent is already terminal; the ZADD was skipped. |
ship.scoped.<verb>.enqueued |
dispatch-scoped.ts after enqueueJob |
A scoped command (rebase, fix_thread, explain_thread, open_pr) was enqueued. |
ship.scoped.<verb>.daemon.completed |
connection-handler.handleScopedJobCompletion and the executor |
Daemon reported succeeded. |
ship.scoped.<verb>.daemon.failed |
Same | Daemon reported halted or failed. reason carries the structured halt reason. |
Querying example (Datadog / Loki)¶
event:"ship.intent.transition" to_status:"human_took_over" terminal_blocker_category:"flake-cap"
| count by pr.repo
Dispatch reasons¶
Canonical source: src/shared/dispatch-types.ts. Four values; all land on dispatch_target=daemon.
| Reason | When the router sets it |
|---|---|
persistent-daemon |
Routed to an existing persistent daemon. The default, hot path. Also used during cooldown when a scale-up was warranted but blocked by the cooldown window. |
ephemeral-daemon-triage |
Triage returned heavy=true and an ephemeral daemon Pod was spawned. |
ephemeral-daemon-overflow |
Queue length ≥ EPHEMERAL_DAEMON_SPAWN_QUEUE_THRESHOLD and the persistent pool has zero free slots; a spawn drains the overflow. |
ephemeral-spawn-failed |
A spawn was required but the K8s API call failed. The job is rejected with a tracking-comment infra error. |
Aggregate reporting¶
When DATABASE_URL is set, helpers in src/db/queries/dispatch-stats.ts expose the most operator-relevant aggregates:
| Helper | Returns |
|---|---|
eventsPerTarget(days) |
Count of executions grouped by dispatch_target. Post-collapse this is always a single daemon row — query dispatch_reason directly for the per-reason split. |
triageRate(days) |
Share of events whose dispatch_reason is ephemeral-daemon-triage. |
avgConfidenceAndFallback(days) |
Mean triage confidence plus fallback counts by reason. |
triageSpend(days) |
Cumulative cost_usd for triage-reached executions. |
Call them from an internal admin endpoint, a scheduled job, or bun repl.
Alerts worth having¶
- Triage error rate.
parse-error+llm-error+timeout+circuit-openabove a sustained threshold (e.g. 10 % over 15 minutes) signals provider trouble or a regression. - Ephemeral spawn failures. Any
dispatch_reason=ephemeral-spawn-failedpoints at RBAC, quota, or control-plane issues. - Heartbeat drift. Daemons missing heartbeats past
HEARTBEAT_TIMEOUT_MSget evicted; sustained eviction points at network or resource-floor issues. - OOM / crash loops. Standard infra alerts. Durable idempotency means a restart will not replay a processed event.
- Ship terminal-blocker rate. A spike in
ship.intent.transitionevents withto_status:human_took_overandterminal_blocker_category:flake-cappoints at PR-flake regressions, not bot misbehaviour.
Data fetching safety caps¶
src/core/fetcher.ts walks every pageInfo { hasNextPage endCursor } connection it receives via octokit.graphql.paginate(...), so PRs/issues with hundreds of comments, reviews, inline review comments, or changed files are no longer silently truncated to the first 100. The four MAX_FETCHED_* env vars (see configuration.md) bound the merged result that reaches the agent prompt — they do not bound how much data is fetched and held in memory during pagination. The fetcher walks every page first, then trims the array to the most recent cap items; fetch-time memory is bounded by GitHub API limits (max items per connection), not these caps. Operators tuning for cost/latency should narrow entity selection (e.g. close noisy issues) rather than rely on the cap to cut request volume.
When a cap fires the fetcher emits a single structured warn line per affected connection and flags the connection on the returned FetchedData so downstream code can surface it:
{
"level": "warn",
"msg": "Fetched comments exceeded MAX_FETCHED cap; truncating to 500",
"connection": "comments",
"fetched": 642,
"cap": 500
}
connection is one of comments, reviews, reviewComments, changedFiles. The matching boolean lands on FetchedData.truncated.<connection> (src/types.ts).
buildPrompt (src/core/prompt-builder.ts) reads data.truncated and, when at least one flag is set, prepends a WARNING: pre-fetched context is incomplete… line to the agent's instructions naming the affected connections and instructing it to reach for the GitHub CLI / API directly when full context matters. Operators reading agent transcripts can grep for that banner to confirm a cap fired without re-querying logs.
Alert rule: any level=warn msg~"exceeded MAX_FETCHED cap" occurrence is interesting. A steady stream from the same repo over several deliveries usually means the cap should be raised for that tenant; a one-off on a notoriously huge PR is expected and not actionable.
Health probes¶
| Path | Purpose |
|---|---|
/healthz |
Liveness — returns 200 once the HTTP server is bound. |
/readyz |
Readiness — 200 once config is validated and the data layer is reachable; flips to 503 on SIGTERM. |
Supply-chain attestations¶
Note: As SBOM file size is over 16MB, temporary disable SBOM attestations.
docker-build.yml publishes two attestation flavours per release tag — same image, different storage and verification surface. Operators investigating a CVE alert or auditing what shipped to production reach for these instead of re-running Trivy from scratch.
| Storage | Format | Bound to | How to inspect |
|---|---|---|---|
| Registry — OCI subject descriptor on the per-arch leaf manifest | SLSA v1 provenance + SPDX 2.3 SBOM (per arch) | Each per-arch image manifest (the BuildKit defaults). | docker buildx imagetools inspect <ref> --format '{{ json .Provenance }}' / {{ json .SBOM }} |
| Registry — Sigstore bundle attached to the merged manifest-list digest | SLSA v1 provenance + CycloneDX 1.5 SBOM (amd64 packages only — Syft scans the runner's native arch; arm64 audits must use the per-arch SPDX SBOM above) | The published tag (orchestrator + daemon variants). | gh attestation verify oci://<ref> --repo chrisleekr/github-app-playground --predicate-type <slsa\|cyclonedx> |
| GitHub Attestations API | Same Sigstore bundles as above | Same tag digest. | Repo Actions ▸ Attestations tab; surfaces under the GitHub Releases "Verified" badge once a tag is published. |
Docker Hub renders a "Build attestations" badge on the tag page once the Sigstore-signed flavour is detected. The full source/predicate of every signature is replayable via the Sigstore transparency log (Rekor) using the digest from gh attestation verify.
The scan job in .github/workflows/docker-build.yml calls gh attestation verify for both predicate types before running Trivy — a regression-gate against silent attestation drops in any future refactor of the build / merge jobs. Consumer-side verification commands live in deployment.md.