## Summary
Adds an LLM-based observability layer that rates each pipeline run on data-quality / silent-failure dimensions, beyond what the success/failed status badge can tell you. Verdicts are produced by Claude Sonnet 4.6 reading the run record + CloudWatch logs, scored deterministically server-side from finding severities, and surfaced through a new dashboard + per-pipeline detail UI.
## What's new
Backend (\src/derive/observer/\):
- Sonnet 4.6 evaluator with a cacheable rubric (10 silent-failure categories tagged C/H/M/L)
- DDB storage with auto-create on first use (PK \run_id\, GSI \pipeline_id+observed_at\, on-demand billing)
- Per-pipeline observability flag (default off) for future Lambda auto-eval gating
- Ignore-finding feature: ignored items get passed back to the model so it stops re-flagging
- Conditional log filter for outlier pipelines with multi-MB log volumes (filter activates only when raw exceeds the cap)
- Score + verdict computed deterministically from findings: \C=−25, H=−10, M=−4, L=−1\; bands \≥90 OK, 60–89 WARN, <60 CRITICAL\
TRPC — 8 new procedures: \getRunObservation\, \evaluateRun\, \getRecentObservations\, \getDashboardObservations\, \getPipelineConfig\, \setPipelineObservability\, \listIgnoredFindings\, \ignoreFinding\, \unignoreFinding\.
UI:
- \/pipelines/dashboard\ — eagle-eye view (status tiles, at-risk pipelines, recently evaluated)
- \/pipelines/all\ — full clean list, every row clickable, status-page sparklines per row
- \/pipelines/[id]\ — split-pane master-detail with full-bleed layout (rail on left, run history on right). Clicking a run opens a slide-over sheet with Observations / Output / Logs tabs
- Trust chip with status-page-style sparkline of recent verdicts (outlined empty slots when no data yet)
- Findings cards with severity stripe + structured \Evidence\ / \Recommendation\ sections + per-finding Ignore action
- Sidebar gets separate Dashboard + All Pipelines nav items
<img width="1310" height="889" alt="Screenshot 2026-05-01 at 7 58 12 PM" src="https://github.com/user-attachments/assets/4a3bc1e6-16b9-456e-85ea-3aa66a885cc5" />
<img width="1009" height="425" alt="Screenshot 2026-05-01 at 7 46 37 PM" src="https://github.com/user-attachments/assets/ed4fd5bf-051c-4767-9860-916479db049a" />
<img width="1308" height="889" alt="Screenshot 2026-05-01 at 7 46 29 PM" src="https://github.com/user-attachments/assets/28eae87c-0c84-49a1-bc44-b2995ff06b3a" />
CLI: \pnpm observer:showcase <run-id>\ for ad-hoc evaluation.
Tests: 5 unit tests covering rubric content + Zod schema validation.
## Behavior notes
- Auto-evaluation never fires from the UI. Opening a run with no cached observation shows a clean empty state with an explicit \"Evaluate this run\" button.
- The per-pipeline \Observe\ toggle gates future Lambda-driven post-completion auto-evaluation. Manual UI buttons always work regardless of the toggle.
- Observations cached forever in DDB by run_id (runs are immutable once finished). \"Re-evaluate\" forces a fresh call.
- Failed runs aren't a finding — clean failures alarm via the existing pathway. Findings reflect data integrity (silent-failure surface).
## Setup
- New env vars in \.env.example\:
- \ANTHROPIC_API_KEY\ — required to evaluate; if missing, evaluations return UNAVAILABLE rather than failing the page
- \SURTR_OBSERVATIONS_TABLE\ — defaults to \surtr_pipeline_observations\
- DDB table is auto-created on first use — no manual provisioning. The IAM principal needs \dynamodb:CreateTable\, \DescribeTable\, \GetItem\, \PutItem\, \Query\.
## Cost & performance
- Per evaluation: ~3K cached system tokens + 2K–8K user tokens, ~500–2K output tokens
- Cached call: roughly \$0.005–\$0.02; first call (cache miss): ~\$0.02–\$0.05
- Sonnet 4.6 prompt cache verified hitting (\cache_read_tokens=3342\ after first call in the showcase)
- Wall-clock: 5–15s per evaluation
## Test plan
- [x] Unit tests pass (\pnpm vitest run test/derive/observer.test.ts\)
- [x] Lint clean on \src/derive/observer\
- [x] CLI showcase runs end-to-end against 3 real pipelines (azure-ai-spend, quickbooks-expense-sync, hubspot-sync) and produces expected verdicts
- [x] DDB table auto-creates on first call
- [x] Prompt cache engages after first evaluation
- [ ] Smoke test in dev: open dashboard, navigate to a pipeline detail, click a run, click \"Evaluate this run\", verify findings render and chip matches verdict
- [ ] Verify the Observe toggle persists across page reloads
- [ ] Verify Ignore finding flow: ignore one, re-evaluate, confirm the model doesn't re-flag
## Out of scope (not in this PR)
- Wiring the evaluator as a Lambda + Step Function step after \update-run-success\ (next step for true post-completion auto-eval)
- DDB stream → SES/Slack alerts on \verdict=CRITICAL\
- Backfill — explicitly skipped; new invocations only
🤖 Generated with [Claude Code](https://claude.com/claude-code)