## Demo
What this proves: the single-mover CE drill works end-to-end against live systems — the account's master payer resolves from the live warehouse, the host → bridge → ESW-CO-ReadOnly-P2 assume-role chain mints a real cross-account CE client, a real service-scoped explain() (5 paginated GetCostAndUsage calls, 9.9s) root-causes a −$19.4k QoQ mover down to the exact usage type with CE totals reconciling to the cent against the warehouse, the shape classifier is correct per branch, and the at-risk dependents (refactored docker/k8s cost service + the four endpoints migrated to require_all_bu_access) still pass — all from real captured output. P1 hits live Redshift; P2 hits live STS + Cost Explorer + Redshift; P3/P4 are pure-local.
### Backend — live master-payer lookup (Redshift, read-only)
Constructs the real RedshiftHandler + MoverExplainService (CE provider stubbed — _lookup_master_account never touches it) and calls the real _lookup_master_account directly, bypassing the router/auth. The master_account_id column read here exists on the live *_adjusted view but is absent from the repo DDL — this run is the proof it resolves.
$ uv run python /tmp/demo-klair2861-p1-master-lookup.py--- 820054669588 (known account -> Umbrella (Khoros) 764203154397) ---
resolved master = 764203154397 consolidated_payer=False
--- 646253092271 (top VDI-payer account -> expect 572481847476) ---
resolved master = 572481847476 consolidated_payer=False
--- 540235812892 (consolidated payer linked acct -> expect 540235812892 (EY Master 1)) ---
resolved master = 540235812892 consolidated_payer=True
--- 000000000000 (unknown account -> expect raise) ---
RAISED HTTPException(status=404): No net-amortized spend found for account 000000000000
Three real accounts resolve to the right payer — including the consolidated-payer path (540235812892 → consolidated_payer=True, which drives the LINKED_ACCOUNT-omitted CE filter) — and an unknown account propagates HTTPException(404) rather than silently returning empty.
### Backend — live cross-account CE drill (STS + Cost Explorer + Redshift, all read-only)
The real CostExplorerClientProvider() + real RedshiftHandler + MoverExplainService.explain() — service-scoped drill for account 646253092271 (under VDI payer 572481847476, the only payer the B0 bridge policy currently admits) on its largest QoQ mover. The target service was picked by a read-only warehouse query (largest |QoQ Δ| with >$1k spend in both quarters): Amazon Virtual Private Cloud, Q4 $75,329 → Q1 $55,962. The only instrumentation is a thin pass-through proxy counting GetCostAndUsage calls; every hop is the real code path.
$ uv run python /tmp/demo-klair2861-p2-live-drill.pyINFO - Refreshing Cost Explorer client via assume-role chain
arn:aws:iam::479395885256:role/klair-api-cost-explorer-role
-> arn:aws:iam::572481847476:role/ESW-CO-ReadOnly-P2
resolved master = 572481847476
account / service = 646253092271 / Amazon Virtual Private Cloud
CE GetCostAndUsage = 5 calls
wall-clock = 9.9s
Top usage_type drivers (Qa=2025-Q4, Qb=2026-Q1, sorted by |diff|):
USE1-TransitGateway-Bytes|us-east-1 qA= 33,458.04 qB= 15,280.62 diff= -18,177.42
USE1-TransitGateway-Hours|us-east-1 qA= 13,739.07 qB= 13,437.87 diff= -301.21
VPN-Usage-Hours:ipsec.1|us-east-1 qA= 13,195.01 qB= 12,908.16 diff= -286.85
DataTransfer-Regional-Bytes|us-east-1 qA= 233.41 qB= 0.20 diff= -233.22
Region split:
us-east-1 qA= 67,157.43 qB= 48,041.33 diff= -19,116.10
eu-west-1 qA= 8,171.91 qB= 7,921.05 diff= -250.86
Purchase mix:
On Demand Instances qA= 75,329.33 qB= 55,962.37 diff= -19,366.96
Daily series = 182 days (2025-10-01 .. 2026-03-31)
Detected shape = kind=steady_ramp direction=down step_date=None burst_days=None
What this shows, beyond "it ran":
- Root cause found: one usage type — USE1-TransitGateway-Bytes (−$18,177) — explains ~94% of the −$19,367 QoQ mover; shape steady_ramp down says it was a gradual decline across the window, not a one-day event.
- CE reconciles with the warehouse to the cent: the drill's purchase-mix totals (qA $75,329.33 / qB $55,962.37) exactly match the warehouse pre-check query on aws_spend_net_amortized_costs_adjusted — the NetAmortizedCost metric pin is doing its job.
- Pagination is real: 5 GetCostAndUsage calls for 3 drills means paginate_cost_and_usage followed a live NextPageToken (the DAILY USAGE_TYPE+REGION drill spans 182 days × many groups).
- The assume-role chain is the deployed one: the logged hop is host → bridge klair-api-cost-explorer-role (479395885256) → ESW-CO-ReadOnly-P2 in the resolved master 572481847476.
_Output trimmed only of connection-pool log lines, two sub-$230 usage rows (EU-TransitGateway-Bytes −$221.26, USE1-PublicIPv4:InUseAddress −$126.69), and an all-zero NoRegion row._
### Backend — shape detection (pure-local, no I/O)
Imports the real detect_shape and runs it on four synthetic daily series.
$ uv run python /tmp/demo-klair2861-p3-shape.py--- FLAT ---
kind = flat direction = None step_date = None burst_days = None
--- STEADY RAMP ---
kind = steady_ramp direction = up step_date = None burst_days = None
--- STEP CHANGE @ idx6=2026-01-07 ---
kind = step_change direction = None step_date = 2026-01-07 burst_days = None
--- BURST 2-day @ idx5-6 ---
kind = burst direction = None step_date = None burst_days = 2
Each series lands on its intended class with correct parameters: ramp direction up, step boundary on the known shift day 2026-01-07, and a 2-day burst counted as burst_days = 2.
### Most at risk from this change
1. The docker/k8s cost endpoints whose auth/client seam was ripped out and replaced by the shared provider (test_cost_explorer_service.py + saas-budgeting router tests).
2. The four endpoints migrated to require_all_bu_access — the 403 gate must still hold (router tests).
3. Master lookup against a live warehouse column not in the repo DDL — P1 above live-proves it.
$ cd klair-api && uv run pytest \tests/services/test_cost_explorer_client.py \
tests/services/test_mover_explain_service.py \
tests/routers/test_cost_movement_explain_router.py \
tests/services/test_cost_explorer_service.py \
tests/routers/test_saas_budgeting_router.py -q
130 passed in 1.45s
## Overview
B1 of the Cost Movement (QoQ) "Explain this mover" phase: a cross-account AWS Cost Explorer drill that root-causes a single QoQ mover. The Phase-A surface tells the user *that* a BU/account moved; this tells them *why* — the warehouse has no usage_type / region / purchase-type granularity, so the answer can only come from CE. This PR extracts a reusable multi-account CE auth seam, adds the MoverExplainService drill on top of it, and exposes it via GET /api/aws-spend/cost-movement/explain.
Linear ticket: [KLAIR-2861 — QoQ B1 — Backend: cross-account Cost Explorer drill service + /cost-movement/explain](https://linear.app/builder-team/issue/KLAIR-2861)
## Specs
- [Spec 05 — backend-cost-explorer-client-provider](features/aws-spend/cost-movement-qoq/specs/05-backend-cost-explorer-client-provider/spec.md) — Multi-account auth-seam refactor. Extracts the cross-account CE auth plumbing out of cost_explorer_service.py into a new services/cost_explorer_client.py so the new drill and the existing SaaS-Budgeting CE endpoints share one seam.
- [Spec 06 — backend-mover-explain-service](features/aws-spend/cost-movement-qoq/specs/06-backend-mover-explain-service/spec.md) — The MoverExplainService drill + GET /cost-movement/explain endpoint that consume the spec-05 seam to root-cause one mover.
## Implementation
Spec 05 — CE client provider:
- New services/cost_explorer_client.py with CostExplorerClientProvider: get_client(account_id, *, session_name) mints + caches per-account CE clients via host → bridge (klair-api-cost-explorer-role, 479395885256) → ESW-CO-ReadOnly-P2. Bridge creds cached once and reused for every target client; per-account client cache keyed by account_id; TTL derived from STS Credentials.Expiration minus a 5-min safety margin (replacing the old fixed 50-min timer); region_name pinned us-east-1; retries={"mode": "standard"}; thread-safe via a double-checked threading.Lock; per-caller session_name threaded to the target AssumeRole for CloudTrail attribution. STS/CE errors propagate (no silent empties).
- Shared module-level paginate_cost_and_usage(client, **params) helper (the NextPageToken loop, extracted from the inline copy).
- cost_explorer_service.py thinned to a provider caller — the module-global _ce_client / _get_ce_client / _build_ce_client and inline pagination loop are deleted; get_docker_cost_by_week / get_kubernetes_cost_by_week behavior unchanged; VDI account sourced from MASTER_PAYERS.
- New require_all_bu_access FastAPI dependency in saas_budgeting_router.py; /docker-cost, /kubernetes-cost, and both /adjustments (POST + DELETE) migrated onto it (inline _user_has_all_dashboard_bus 403 blocks removed).
Spec 06 — mover explain service:
- New services/mover_explain_service.py (MoverExplainService): master-payer lookup from core_finance.aws_spend_net_amortized_costs_adjusted (master_account_id column live-verified 2026-06-10); NetAmortizedCost-pinned drill SERVICE → USAGE_TYPE+REGION → PURCHASE_TYPE → DAILY; LINKED_ACCOUNT filter omitted for the consolidated payers EY (540235812892) and Wine Cellar (637422716207); include_bedrock exclusion via 27 live-derived CE SERVICE names mirroring the is_ai_service UDF; shape detection (flat / steady_ramp / step_change(date) / burst(N)).
- New GET /api/aws-spend/cost-movement/explain endpoint in aws_spend_router.py — carries the CROSS-ACCOUNT COST EXPLORER ROLE CONSUMER stamp, gated by require_all_bu_access, reversed/equal-quarter 400 guard, asyncio.to_thread dispatch; Pydantic v2 response models added to cost_explorer_models.py.
- /cost-movement/explain appended to consumers in cost_explorer_master_payers.json (B0 carry-forward doctrine).
## Test coverage
70+ new unit tests across:
- tests/services/test_cost_explorer_client.py — provider caching / TTL / concurrency / error propagation; bridge-once reuse; region_name pin; session_name threading; paginate_cost_and_usage NextPageToken concatenation.
- tests/services/test_mover_explain_service.py — master-lookup SQL shape; consolidated-payer LINKED_ACCOUNT omission; NetAmortizedCost metric pinning; drill call sequence; shape detection incl. edge cases.
- tests/routers/test_cost_movement_explain_router.py — endpoint 403 (non-all-BU) / 400 (reversed quarter) / 200; param aliases.
All passing; CI green (ruff-check pass; frontend jobs skip — backend-only).
## Self-review findings addressed
1. master_account_id source — live-verified against the read-only Redshift cluster (the column is on the live *_adjusted view but absent from the repo DDL); confirmed the A-phase mapping table has no payer column, so the adjusted view is the correct source.
2. Bedrock CE set — expanded the exact-match BEDROCK_CE_SERVICES set from warehouse ground truth (CE Dimensions filters can't wildcard the is_ai_service LIKE patterns), including per-model "Edition" names; Amazon QuickSight excluded (no false positive).
3. Reversed-quarter guard — added a router-level 400 for a reversed/equal quarter pair before the drill runs, since the CE TimePeriod is derived quarter_a-start → quarter_b-end.
## Stacked on B0 (#2989)
This PR is stacked on klair-2860-ce-bridge-role-reconciler (B0, #2989). B1's cross-payer reach depends on B0 expanding the bridge inline policy to the 9 non-VDI payers — until that lands, only VDI (572481847476) is reachable; the provider is account-parameterized and ready for the fan-out. Confirm the bridge→payer hop after B0 lands (noted in the ticket). If B0 has already merged to main, rebase this branch onto main.
🤖 Generated with [Claude Code](https://claude.com/claude-code)