## Summary
PR 2 of pipeline-triage-agent. Stacked on top of #109 (PR 1) — review after that lands, or set base = main once #109 is merged.
Adds the dispatcher Lambda that subscribes to both SNS topics, runs the full dedupe ladder, and triggers a workflow_dispatch on a new stub triage-agent.yml. The stub workflow opens a labeled Issue with the raw dispatch inputs — enough to prove the end-to-end SNS → Lambda → GitHub path works without yet invoking the real agent (that's PR 3).
Self-containment: both kill switches default OFF (TRIAGE_DISPATCHER_ENABLED env var on the Lambda; TRIAGE_AGENT_RUN_ENABLED repo variable on the workflow). Deploying this PR produces zero Issues, zero GH calls, zero S3 writes, zero model spend.
## What this adds
### Lambda (pipelines/cdk/lambdas/triage-agent-dispatcher/)
- src/handler.py — full dispatcher logic. Honors kill switch, normalizes both event kinds, computes signature, runs DDB dedupe (open → comment, closed → regression, miss → dispatch, ≥7d → treat as miss per D16), enforces hourly rate limit (D10), snapshots run record + last 200 CW log lines to S3, presigns a 10-min GET URL, POSTs workflow_dispatch. Per-record exception swallow so one bad message doesn't poison a batch. ~430 lines.
- src/signature.py — Python port of Surtr/src/lib/triage/signature.ts with the same regex set and ordering.
- src/requirements.txt — boto3>=1.37.0 (per the CLAUDE.md bundling rule).
### Tests (pipelines/cdk/lambdas/tests/)
- test_triage_dispatcher_signature.py — 29 tests including cross-language parity checks that lock the sha256 input format against silent drift between the TS and Python implementations.
- test_triage_dispatcher_handler.py — 13 tests covering: kill switch, routing/unknown payloads, all four dedupe branches (miss / open / closed / 7d-stale), rate limit short-circuit, observer-finding severity sort, batch-with-one-bad-record.
### CDK (pipelines/cdk/lib/pipeline-shared-stack.ts)
- PythonFunction for the dispatcher (Python 3.11, 60 s, 256 MB).
- New S3 bucket surtr-triage-context-{env}-{account} — block-public, S3-managed encryption, SSL-enforced, 30-day lifecycle, RemovalPolicy.RETAIN.
- IAM: scoped to what's strictly needed — DDB R/W on the triage table only, S3 PutObject + GetObject on the new bucket only, Secrets Manager read on SURTR_PROD_KEYS only, CW Logs FilterLogEvents scoped to /klair/pipelines/{env}/*. No iam:*, no lambda:UpdateFunctionCode, no wildcard secret access.
- SNS subscriptions to both topics with notification_type filter policies so each topic only delivers the events it should.
- All flags default to "off"/"false" so the deployed Lambda is inert.
### Workflow (.github/workflows/triage-agent.yml)
- Stub workflow with typed workflow_dispatch inputs matching the dispatcher payload.
- Kill switch via repo variable TRIAGE_AGENT_RUN_ENABLED (default missing/false).
- All inputs flow through env: blocks, never inline ${{ }} inside run: — mitigates the Comment-and-Control prompt-injection class per D21.
- Masks the pre-signed URL with ::add-mask:: since the signed-key suffix would otherwise leak in logs.
- Opens an Issue labeled agent-triage + agent-triage:<pipeline_id> + stub with the raw inputs. No agent, no PR. PR 3 replaces the body with the real diagnose-then-fix flow.
- concurrency: triage-agent-${{ inputs.signature }} so simultaneous dispatches for the same signature don't open duplicates (defense-in-depth on top of DDB dedupe).
## Test plan
- [x] python3 -m pytest tests/test_triage_dispatcher_signature.py tests/test_triage_dispatcher_handler.py — 42/42 pass
- [x] Cross-language parity locked: failure_signature("p", "KeyError: 'tenant_id'") produces the same hex in both Python and TS
- [x] cd pipelines/cdk && npm run build — clean tsc
- [x] YAML parses (python3 -c "yaml.safe_load(...)")
- [ ] CI synth on a docker-enabled runner
- [ ] After merge, in dev: set SURTR_TRIAGE_TOPIC_ARN on the Surtr ECS task + TRIAGE_DISPATCHER_ENABLED=true on the Lambda + TRIAGE_AGENT_RUN_ENABLED=true repo var → trigger a real failure → confirm a stub Issue appears.
## Production-deploy effect on merge
- New empty S3 bucket (no objects)
- New Lambda subscribed to both SNS topics, but TRIAGE_DISPATCHER_ENABLED=false → returns immediately on every event
- New workflow file, kill switch off → never proceeds past the first step
## Rollout (after merge)
| Step | Flag | Effect |
|---|---|---|
| 1 | SURTR_TRIAGE_TOPIC_ARN on Surtr ECS (PR 1) | Observer publishes to SNS |
| 2 | TRIAGE_DISPATCHER_ENABLED=true on dispatcher Lambda | Lambda processes events, snapshots context, calls workflow_dispatch |
| 3 | TRIAGE_AGENT_RUN_ENABLED=true repo var | Stub workflow opens Issues |
Reverse the order to disable.
## Out of scope (covered by later PRs)
- Real agent invocation (Claude Code + Codex), AGENTS.md, fix-stage with path allowlist + test gate — PR 3
- GChat outcome card + 15-min reconciler + optional Surtr UI tile — PR 4
🤖 Generated with [Claude Code](https://claude.com/claude-code)