## Summary
- Root cause (auth): enrichment.py constructed Anthropic() with no auth wiring — no ANTHROPIC_API_KEY env var, no Secrets Manager fetch — so every Haiku call in the last run failed with *"Could not resolve authentication method. Expected either api_key or auth_token to be set"*. 536/536 records failed.
- Why it reported SUCCESS: Three nested try/except wrappers (_enrich_one → _enrich_all → handler.py) all caught the auth error and continued. The handler's # base table unaffected swallow couldn't distinguish "a few transient failures" from "100% wipeout."
- Fix 1 — wire the key: Add ANTHROPIC_API_KEY_SECRET_ARN env var + IAM secretsmanager:GetSecretValue in pipeline.json, reusing the existing klair/anthropic-api-key-yFl13y secret that netsuite-gl-detail already consumes. New get_anthropic_api_key() helper in aws_secrets.py (env var first for local dev, then Secrets Manager). RCAEnricher.__init__ now passes api_key= explicitly.
- Fix 2 — fail loudly: In handler.py, raise RuntimeError when len(documents) > 0 and enriched == 0. The exception propagates out of main.py, ECS exits non-zero, the run is marked FAILED, and alerting fires. Partial failures stay non-fatal — the base table is still useful and a few transient API errors shouldn't page anyone.
## Test plan
- [x] pytest tests/ — 70 passed (7 new):
- test_aws_secrets.py: get_anthropic_api_key env-var path, Secrets Manager path, env precedence, missing config raises, empty secret raises
- test_enrichment.py: Anthropic constructed with api_key= arg
- test_handler.py: raises on total failure, raises on enricher-constructor blowup (missing key), partial failure does NOT raise, empty DB does NOT raise
- [x] ruff check + ruff format clean on all modified files
- [x] pipeline.json validates as JSON
- [x] After merge & deploy: trigger the pipeline manually and confirm Haiku enrichment loads rows into mart_other.rca_incidents_enriched — see Demo below
## Demo — end-to-end verification on prod
Deployed this PR's branch directly to prod via cdk deploy Pipeline-notion-rca-hub-sync-prod --exclusively and ran a test-mode execution (50 pages).
### 1. Deploy landed cleanly
- Stack: Pipeline-notion-rca-hub-sync-prod → UPDATE_COMPLETE (4/6 resources changed)
- Task definition: new revision 7 registered, old revision deregistered
- Image: cdk-hnb659fds-container-assets-...:11c02030... (PR's fix code)
- Env var added: ANTHROPIC_API_KEY_SECRET_ARN=arn:aws:secretsmanager:us-east-1:479395885256:secret:klair/anthropic-api-key-yFl13y
- IAM updated: task role's inline policy now allows secretsmanager:GetSecretValue on both the existing Notion secret and the new Anthropic secret
### 2. Step Function ran green
Execution: pipeline-notion-rca-hub-sync-prod:5fd4b901-a685-489f-b589-4a51010e242bInput: {"run_id":"manual-test-...", "params":{"test_mode":true}}
Status: SUCCEEDED
Duration: 1m 40s (ECS task: 34s)
### 3. CloudWatch logs prove the new auth path executed
Key lines from log stream pipeline/pipeline-notion-rca-hub-sync/f5d981e63c39465daea2f09d9d90fa7b:
10:37:58 root TEST MODE active: limited to 50 pages10:38:07 redshift_loader Loaded 50 rows into mart_other.rca_incidents
10:38:07 aws_secrets Retrieving Anthropic API key from secret:
arn:aws:secretsmanager:us-east-1:479395885256:secret:klair/anthropic-api-key-yFl13y
10:38:14 httpx POST https://api.anthropic.com/v1/messages "HTTP/1.1 200 OK"
... (× 50 — every Haiku call returned 200)
10:38:26 enrichment Loaded 50 rows into mart_other.rca_incidents_enriched
10:38:26 enrichment Enrichment complete: 50 enriched, 0 failed, 50 loaded
10:38:26 root Pipeline completed: 50 transformed, 50 loaded to Redshift, 50 enriched, 33688ms
Final summary emitted to Step Functions:
{"run_id": "...",
"pages_fetched": 537,
"pages_transformed": 50,
"redshift_rows_loaded": 50,
"enrichment": {"enriched": 50, "failed": 0, "loaded": 50, "errors": []},
"errors": []
}
Contrast with the failing run that triggered this PR:
- Before: "Could not resolve authentication method..." × 536, {"enriched": 0, "failed": 536, "loaded": 0} — and yet pipeline reported SUCCESS
- After: 50/50 Haiku calls returned 200 OK, full enrichment
### 4. Redshift confirms rows landed
SELECT COUNT(*),COUNT(DISTINCT notion_page_id),
SUM(CASE WHEN severity IS NOT NULL THEN 1 ELSE 0 END),
SUM(CASE WHEN root_cause_category IS NOT NULL THEN 1 ELSE 0 END),
SUM(CASE WHEN incident_start_date IS NOT NULL THEN 1 ELSE 0 END),
SUM(CASE WHEN one_line_summary IS NOT NULL THEN 1 ELSE 0 END)
FROM mart_other.rca_incidents_enriched;
| total_rows | distinct_pages | with_severity | with_root_cause | with_start_date | with_summary |
|------------|----------------|---------------|-----------------|-----------------|--------------|
| 50 | 50 | 50 | 50 | 50 | 50 |
Sample rows (Haiku extractions look sensible):
| doc_name | aws_class | severity | root_cause | start_date | summary |
|---|---|---|---|---|---|
| Khoros Service Issue LIA-23857 | Khoros Product | critical | database | 2026-04-27 | CDN cookie caching and unbounded gallery queries saturated Aurora cluster |
| RCA for Khoros LIA-23911 | Khoros Product | high | configuration | 2026-04-28 | Faulty SVN config include order removed theme plugin, breaking page rendering |
| RCA for Khoros AURORA-1211 | Khoros Product | critical | capacity | 2026-04-24 | Insufficient Aurora enduser capacity caused 10-hour outage with 503 errors |
| RCA for Khoros LIA-23520 | Khoros Product | critical | capacity | 2026-04-23 | Sephora login surge saturated shared Aurora writer, causing ServiceNow 503 |
| RCA for Kandy KANDY-787 | Kandy Product | critical | infrastructure | 2026-04-22 | DRBD/NFS storage gaps during failover caused 2h17m call-recording playback outage |
The scheduled cron (cron(0 6 * * ? *)) will pick up the full 537-page set on its next run.
🤖 Generated with [Claude Code](https://claude.com/claude-code)