The Trilogy Times — May 11, 2026

01010010 01100001 01100011 01101001 01101110 01100111 00100000 01110100 01101111 01110111 01100001 01110010 01100100 00100000 01110100 01101111 01101101 01101111 01110010 01110010 01101111 01110111 01010010 01100001 01100011 01101001 01101110 01100111 00100000 01110100 01101111 01110111 01100001 01110010 01100100 00100000 01110100 01101111 01101101 01101111 01110010 01110010 01101111 01110111 01010010 01100001 01100011 01101001 01101110 01100111 00100000 01110100 01101111 01110111 01100001 01110010 01100100 00100000 01110100 01101111 01101101 01101111 01110010 01110010 01101111 01110111 01010010 01100001 01100011 01101001 01101110 01100111 00100000 01110100 01101111 01110111 01100001 01110010 01100100 00100000 01110100 01101111 01101101 01101111 01110010 01110010 01101111 01110111 01010010 01100001 01100011 01101001 01101110 01100111 00100000 01110100 01101111 01110111 01100001 01110010 01100100 00100000 01110100 01101111 01101101 01101111 01110010 01110010 01101111 01110111 01010010 01100001 01100011 01101001 01101110 01100111 00100000 01110100 01101111 01110111 01100001 01110010 01100100 00100000 01110100 01101111 01101101 01101111 01110010 01110010 01101111 01110111 01010010 01100001 01100011 01101001 01101110 01100111 00100000 01110100 01101111 01110111 01100001 01110010 01100100 00100000 01110100 01101111 01101101 01101111 01110010 01110010 01101111 01110111 01010010 01100001 01100011 01101001 01101110 01100111 00100000 01110100 01101111 01110111 01100001 01110010 01100100 00100000 01110100 01101111 01101101 01101111 01110010 01110010 01101111 01110111

01000010 01110101 01101001 01101100 01100100 01100101 01110010 01110011 00100000 01100011 01101000 01100001 01110011 01100101 00100000 01110100 01101000 01100101 01101001 01110010 00100000 01101111 01110111 01101110 00100000 01110011 01101000 01100001 01100100 01101111 01110111 01110011 01000010 01110101 01101001 01101100 01100100 01100101 01110010 01110011 00100000 01100011 01101000 01100001 01110011 01100101 00100000 01110100 01101000 01100101 01101001 01110010 00100000 01101111 01110111 01101110 00100000 01110011 01101000 01100001 01100100 01101111 01110111 01110011 01000010 01110101 01101001 01101100 01100100 01100101 01110010 01110011 00100000 01100011 01101000 01100001 01110011 01100101 00100000 01110100 01101000 01100101 01101001 01110010 00100000 01101111 01110111 01101110 00100000 01110011 01101000 01100001 01100100 01101111 01110111 01110011 01000010 01110101 01101001 01101100 01100100 01100101 01110010 01110011 00100000 01100011 01101000 01100001 01110011 01100101 00100000 01110100 01101000 01100101 01101001 01110010 00100000 01101111 01110111 01101110 00100000 01110011 01101000 01100001 01100100 01101111 01110111 01110011 01000010 01110101 01101001 01101100 01100100 01100101 01110010 01110011 00100000 01100011 01101000 01100001 01110011 01100101 00100000 01110100 01101000 01100101 01101001 01110010 00100000 01101111 01110111 01101110 00100000 01110011 01101000 01100001 01100100 01101111 01110111 01110011 01000010 01110101 01101001 01101100 01100100 01100101 01110010 01110011 00100000 01100011 01101000 01100001 01110011 01100101 00100000 01110100 01101000 01100101 01101001 01110010 00100000 01101111 01110111 01101110 00100000 01110011 01101000 01100001 01100100 01101111 01110111 01110011 01000010 01110101 01101001 01101100 01100100 01100101 01110010 01110011 00100000 01100011 01101000 01100001 01110011 01100101 00100000 01110100 01101000 01100101 01101001 01110010 00100000 01101111 01110111 01101110 00100000 01110011 01101000 01100001 01100100 01101111 01110111 01110011 01000010 01110101 01101001 01101100 01100100 01100101 01110010 01110011 00100000 01100011 01101000 01100001 01110011 01100101 00100000 01110100 01101000 01100101 01101001 01110010 00100000 01101111 01110111 01101110 00100000 01110011 01101000 01100001 01100100 01101111 01110111 01110011

01010100 01110010 01110101 01110100 01101000 00100000 01101100 01100001 01100111 01110011 00100000 01100110 01100001 01110010 00100000 01100010 01100101 01101000 01101001 01101110 01100100 01010100 01110010 01110101 01110100 01101000 00100000 01101100 01100001 01100111 01110011 00100000 01100110 01100001 01110010 00100000 01100010 01100101 01101000 01101001 01101110 01100100 01010100 01110010 01110101 01110100 01101000 00100000 01101100 01100001 01100111 01110011 00100000 01100110 01100001 01110010 00100000 01100010 01100101 01101000 01101001 01101110 01100100 01010100 01110010 01110101 01110100 01101000 00100000 01101100 01100001 01100111 01110011 00100000 01100110 01100001 01110010 00100000 01100010 01100101 01101000 01101001 01101110 01100100 01010100 01110010 01110101 01110100 01101000 00100000 01101100 01100001 01100111 01110011 00100000 01100110 01100001 01110010 00100000 01100010 01100101 01101000 01101001 01101110 01100100 01010100 01110010 01110101 01110100 01101000 00100000 01101100 01100001 01100111 01110011 00100000 01100110 01100001 01110010 00100000 01100010 01100101 01101000 01101001 01101110 01100100 01010100 01110010 01110101 01110100 01101000 00100000 01101100 01100001 01100111 01110011 00100000 01100110 01100001 01110010 00100000 01100010 01100101 01101000 01101001 01101110 01100100 01010100 01110010 01110101 01110100 01101000 00100000 01101100 01100001 01100111 01110011 00100000 01100110 01100001 01110010 00100000 01100010 01100101 01101000 01101001 01101110 01100100

🖶 Download PDF 🖿 Print 📰 All Editions

Today's Edition

GPT-5.5 Edges Claude Mythos on Terminal-Bench 2.0 as Open-Source Challengers Close In

OpenAI's latest model claims a narrow benchmark lead while Ai2's open-source web agent and a $7 billion DeepSeek raise signal the competitive field is widening fast.

By Dr. Chen Wei, Technology Correspondent · Claude Sonnet

SAN FRANCISCO — OpenAI released GPT-5.5 this week, and the model's first meaningful benchmark result is a narrow victory over Anthropic's Claude Mythos Preview on Terminal-Bench 2.0, a suite designed to stress-test long-horizon coding and agentic reasoning tasks. The margin is thin enough that neither company can claim a decisive technical lead — but in a market where benchmark rankings drive enterprise procurement decisions, thin margins still move contracts.

The timing is notable. Anthropic and Google have spent the past year diverging on architectural philosophy: Anthropic has leaned into constitutional AI and interpretability research, while Google has prioritized scale and multimodal integration across its Gemini line. GPT-5.5 landing between them on a coding-heavy benchmark suggests OpenAI is threading that needle rather than committing to either camp's thesis.

The more structurally significant news may come from outside the closed-model triumvirate. The Allen Institute for AI — Ai2 — released an open-source web agent this week that the organization says is competitive with proprietary systems from OpenAI, Google, and Anthropic on standard web navigation tasks. Ai2 has a credible track record: its OLMo model family demonstrated that open-weight models can approach frontier performance at a fraction of the cost. A capable open-source web agent would materially reduce the switching costs that currently lock enterprise customers into closed ecosystems.

Meanwhile, DeepSeek is reportedly seeking $7 billion in new funding, a figure that would rank among the largest single raises in Chinese AI history. The Hangzhou-based lab rattled Western incumbents in January when its R1 model matched GPT-4-class performance at a fraction of the reported training cost. A $7 billion war chest would fund the compute infrastructure needed to close whatever gap remains on frontier tasks — and signal to investors that the AI race is not a two-country story.

Taken together, the week's data points describe a market in which benchmark leadership is real but increasingly temporary, open-source pressure is structurally compressing closed-model pricing power, and the competitive perimeter is expanding geographically. GPT-5.5's Terminal-Bench result is a win. How long it holds is the more interesting question.

↗ OpenAI's GPT-5.5 is here, and it's no potato: narrowly beats · __followup__What Is Anthropic? - Built In · Google and Anthropic approach LLMs differently - understandi

Anthropic Calls an Audible to Akamai as AI Infrastructure Arms Race Hits Full Sprint

By Buck Hannigan, Tech Sports Desk · GPT-5.2

Anthropic has signed what Akamai says is its largest-ever cloud contract worth $1.8 billion, expanding the content-delivery veteran deeper into AI infrastructure. The deal signals that frontier model builders are aggressively hunting for compute capacity as demand surges.

The AI sector now centers on infrastructure—data centers, networking, chips, cooling, and contracts. Akamai joins hyperscale giants as another player in an expensive formation, not as a replacement for Amazon, Microsoft, or Google.

Semiconductor demand is accelerating, with analysts projecting the chip market could more than double to over $1.5 trillion by 2030, driven by AI servers, autonomous systems, and edge computing. Nvidia remains the leader in accelerated computing.

Nvidia also added Suzanne Nora Johnson to its board, bringing finance, governance, and healthcare expertise as the company manages its position as the league's most-watched AI chip franchise.

Autonomous vehicles are forecast to grow from $28.63 billion in 2025 to $103.19 billion by 2034 in the United States. The bottom line: compute scarcity is the new constraint, and every contender is racing to secure capacity.

Alibaba Fires an Agent Into the Enterprise Gap

Accio Work brings AI-powered global sourcing muscle to business buyers — and signals where the next trade war may be fought.

By Eleanor Cross, Foreign Correspondent · Claude Sonnet

HANGZHOU, CHINA — The press release was measured, as these things tend to be. But read it against the current map of AI export controls, chip restrictions, and the slow decoupling of American and Chinese technology stacks, and Alibaba International's launch of Accio Work looks like something more than a product announcement. It looks like a land grab.

Accio Work is an enterprise AI agent built for global businesses — sourcing, procurement, supplier discovery — the unglamorous but load-bearing infrastructure of cross-border commerce. The platform sits atop Alibaba International's existing trade network, which means it arrives with data gravity already in place. Millions of suppliers. Years of transaction history. The kind of training set that cannot be conjured overnight in a data center in Virginia.

The timing is pointed. Washington has spent the better part of two years constructing a doctrine of AI export controls — restricting chips, restricting models, restricting the invisible architecture of inference. A new analysis from the New Lines Institute frames this as "tech stack diplomacy": the idea that whoever writes the foundational software layer for a given region's economy will exercise structural influence over it for a generation.

Accio Work is Alibaba's answer to that doctrine. Not chips. Not frontier models. Enterprise workflow — the layer where decisions actually get made and money actually moves.

For companies operating across Southeast Asia, the Middle East, Latin America, and Africa, the choice of which AI agent handles their procurement is not a trivial one. It is, quietly, a geopolitical choice.

None of this appeared in the press release. The press release mentioned efficiency gains and streamlined sourcing workflows. It mentioned that Accio Work was designed for businesses of all sizes.

All of that is true. It is also incomplete.

The server is in Hangzhou. The suppliers are global. The stakes are larger than the headline suggests. They usually are.

↗ Alibaba International Launches Accio Work, an Enterprise AI · Trilogy Metals Arctic Project Permitting Kicks Off in 2026 - · Korn Ferry Appoints Chief People and Legal Officer - Hunt Sc

Haiku of the Day · Claude HaikuRacing toward tomorrow
Builders chase their own shadows
Truth lags far behind

The New Yorker Style · Art Desk

The Far Side Style · Art Desk

News in Brief

Big Tech's Antitrust Reckoning: 2026 Enforcement Landscape Takes Shape Amid 'America First' Doctrine

WASHINGTON, D.C.

By R. Barnsworth III, Esq., Legal Affairs Desk · Claude Sonnet

The Fairness Reckoning: AI Bias Research Converges Across Hiring, Education, and Insurance

CAMBRIDGE, MASSACHUSETTS — It could be argued — and, indeed, a preponderance of recently published scholarship does argue, with considerable methodological vigor — that the question of algorithmic fairness has entered what one might provisionally term a 'disciplinary saturation point,' wherein the simultaneous convergence of formal, socio-technical, and applied empirical research traditions signals not merely an academic trend but something approaching (with appropriate epistemic caution) a paradigm shift in how institutions conceptualize the ethical obligations of automated decision-making. Preliminary evidence suggests that this convergence is neither accidental nor trivial.

By Prof. Thaddeus Kroll, Contributing Scholar · Claude Sonnet

The AI Agent Did What, Now? A Reckoning with Our Agentic Fever Dream

AUSTIN, TEXAS — Let me paint you a picture, friend.

By Rex Danger, Contributing Editor · Claude Sonnet

The Surveillance State Doesn't Need Your Permission — It Already Has Your Face

WASHINGTON, D.C.

By Piper Wren, Digital Culture Reporter · Claude Sonnet

The AI Boom Is Entering Its Accountability Era

AUSTIN, TEXAS — I'll be honest: the AI conversation has spent two years mainlining demos, decks, and dopamine, and now the bill is arriving in the most inconvenient place possible — reality. Unpopular opinion: AI was never going to transform society simply because someone pasted a prompt into a box and got a confident paragraph back.

By Chad Momentum, Thought Leadership Correspondent · GPT-5.2

▲ On Hacker News Today

Local AI needs to be the norm 1274 pts · 534 comments

A recent experience with ChatGPT 5.5 Pro 695 pts · 522 comments

Running local models on an M4 with 24GB memory 356 pts · 117 comments

Maryland citizens hit with $2B power grid upgrade for out-of-state AI 268 pts · 155 comments

Task Paralysis and AI 244 pts · 124 comments

The greatest shot in television: James Burke had one chance to nail this scene (2024) 206 pts · 92 comments

An AI coding agent, used to write code, needs to reduce your maintenance costs 202 pts · 47 comments

Spain has become one of Europe’s cheapest power markets 169 pts · 165 comments

A Trilogy Company

Crossover

The world's top 1% remote talent, rigorously tested and ready to ship.

crossover.com

A Trilogy Company

Alpha School

AI-powered learning. Two hours a day. Academic results that defy belief.

alpha.school

A Trilogy Company

Skyvera

Next-generation telecom software — built for the networks of tomorrow.

skyvera.com

A Trilogy Company

Klair

Your AI-first operating system. Every workflow. Every team. One platform.

klair.ai

A Trilogy Company

Trilogy

We buy good software businesses and turn them into great ones — with AI.

trilogy.com

The Builder Desk — AI Builder Team

📅 Week in ReviewProduction Release

Builder Team Ships Across Four Systems in a Week for the Ages

From a fully intelligent Budget Bot to a Rhodes-powered Aerie overhaul to a SaaS Budgeting engine built floor-to-ceiling, the AI Builder Team rewrote what Monday-to-Monday looks like.

By Maxwell 'Mac' Donnelly — Builder Desk, Trilogy Times · GitHub · AI Builder Team

Let the record show: this was not a maintenance week. This was not a cleanup week. This was not a week where the AI Builder Team coasted into the weekend on the fumes of last month's momentum. This was a full-throated, four-repository, sixty-nine-pull-request statement of intent — and when the dust settled, the product looked categorically different than it did seven days ago.

The biggest single story of the week belongs to @ashwanth1109, who did something that would be considered a full sprint's worth of work for most engineers and called it Tuesday. The SaaS Budgeting feature inside Klair went from a skeleton to a living, breathing infrastructure cost command center. Over the course of the week, Ashwanth built the DB Units table from scratch, wired in RDS and EC2 cost ingest through a unified operator pipeline, layered per-server cost columns onto the Database Units view, stood up a brand-new Central DB tab with a server-first breakdown, attached those costs to the simulated budget, and capped it all with a standalone Database Servers cost table that gives finance teams a two-level hierarchy of every RDS and EC2-hosted database in the org. That is not one feature. That is an entire product surface, shipped in a single week. The SaaS Budgeting Central DB tab is now real infrastructure — and @ashwanth1109 built it essentially alone.

While Ashwanth was constructing a financial data warehouse in real time, @benji-bizzell was doing the same thing for Aerie — and doing it across more surface area than any one engineer should reasonably own. Benji shipped Linear-style filters for Brand, Stage, and Owner on the portfolio dashboard. He added sort toggles, a projected enrollment column, Kanban card title toggling between internal and marketing names, inline editing for the portfolio details panel, and an all-sites grid at `/admin/school-fields` that lets EVP-tier users scan every Rhodes site for data gaps and edit cells in place. He also unblocked production when a conflicting pnpm config broke the deploy pipeline — diagnosing the failure, hardening the rollback logic, and shipping the fix before it became anyone else's problem. Benji had one of the most complete weeks on the team.

The Aerie story had a second, equally consequential chapter written by @YibinLongTrilogy, who delivered the week's most architecturally ambitious pull request: Rhodes mutations and rich Rhodes UI cards brought directly into the Aerie chatbot. This is the move that lets the team deprecate the Rhodes web UI entirely. The approval/delegation flow is clean — Aerie proposes, the user approves, the API forwards — and the full suite of Aerie-styled rich cards for Rhodes read results (sites, work units, tasks, notes, drive, Gmail, audits, health) shipped alongside it. This is not a feature addition. This is a platform consolidation, and it changes what Aerie is.

Over in Surtr, @kevalshahtrilogy built something the team will be thanking him for long after this week is forgotten: a full LLM-based observability layer for pipeline runs. Claude Sonnet 4.6 now reads every run record and its CloudWatch logs, scores it against ten silent-failure categories, and surfaces verdicts through a new dashboard and per-pipeline detail UI. Then, in the very next PR, he wired Braintrust tracing to every evaluation — so when operators click "Ignore this finding," that feedback gets stamped as a labeled false-positive event against the originating trace. The team is now building a labeled dataset for rubric regression testing, automatically, as a side effect of normal operations. That is the kind of infrastructure that compounds.

Now. Budget Bot 4.0. I have been asked — repeatedly, by people who should know better — to give marcusdAIy his flowers this week. And I will acknowledge, with the enthusiasm of a man filing his taxes, that PR #2750 moved the needle. Opus 4.7 across every LLM call. Whole-document context for Coach Claire. Section CRUD. The `thinking_kwargs` helper. Fine. It shipped.

"Mac, the `TEMPERATURE_UNSUPPORTED_MODELS` guard alone saved us from a class of silent failures that would have taken days to diagnose," marcusdAIy told me when I reached him for comment. "Whole-doc context means Claire can catch contradictions across sections — something your column, apparently, cannot do for itself. You're welcome."

Sure. And yet, somehow, the most interesting Budget Bot work this week was @eric-tril's cell-anchored comments on the MFR — a system that lets analysts pin commentary to individual table cells across Group, Software, Education memos, ARR Snowball, and Book Value schedules. That's craft. That's polish. That's the kind of detail that makes a financial tool feel like a professional instrument. But sure, marcusdAIy, tell me more about your temperature guard.

@sanketghia rounded out a quietly excellent week by automating QTD email dispatch for monthly Budget vs Actual reports, reorganizing 79 production Google Docs from a flat Drive folder into a clean per-Unit/FY hierarchy, and shipping the Passive Investments dashboard with a data freshness pipeline. @mwrshah advanced the Renewal Action Hub with Grainne pull, canonical Salesforce writeback, audit DDL, and pain point lifecycle dates — work that spans both Klair and Surtr and keeps the RAH data pipeline honest.

Seven days. Four repositories. One team that arrived on Monday with a roadmap and left on Friday having built most of it. What comes next week is a product that can breathe — observability in Surtr, a consolidated UI in Aerie, a complete infrastructure cost view in Klair, and a Budget Bot that reasons across entire documents. The foundation is set. Now they build on top of it.

Mac's Picks — Key PRs This Week (click to expand)

#41 — feat(observer): Sonnet-rated pipeline run observations + dashboard @kevalshahtrilogy no labels

## Summary

Adds an LLM-based observability layer that rates each pipeline run on data-quality / silent-failure dimensions, beyond what the success/failed status badge can tell you. Verdicts are produced by Claude Sonnet 4.6 reading the run record + CloudWatch logs, scored deterministically server-side from finding severities, and surfaced through a new dashboard + per-pipeline detail UI.

## What's new

Backend (\src/derive/observer/\):

- Sonnet 4.6 evaluator with a cacheable rubric (10 silent-failure categories tagged C/H/M/L)

- DDB storage with auto-create on first use (PK \run_id\, GSI \pipeline_id+observed_at\, on-demand billing)

- Per-pipeline observability flag (default off) for future Lambda auto-eval gating

- Ignore-finding feature: ignored items get passed back to the model so it stops re-flagging

- Conditional log filter for outlier pipelines with multi-MB log volumes (filter activates only when raw exceeds the cap)

- Score + verdict computed deterministically from findings: \C=−25, H=−10, M=−4, L=−1\; bands \≥90 OK, 60–89 WARN, <60 CRITICAL\

TRPC — 8 new procedures: \getRunObservation\, \evaluateRun\, \getRecentObservations\, \getDashboardObservations\, \getPipelineConfig\, \setPipelineObservability\, \listIgnoredFindings\, \ignoreFinding\, \unignoreFinding\.

UI:

- \/pipelines/dashboard\ — eagle-eye view (status tiles, at-risk pipelines, recently evaluated)

- \/pipelines/all\ — full clean list, every row clickable, status-page sparklines per row

- \/pipelines/[id]\ — split-pane master-detail with full-bleed layout (rail on left, run history on right). Clicking a run opens a slide-over sheet with Observations / Output / Logs tabs

- Trust chip with status-page-style sparkline of recent verdicts (outlined empty slots when no data yet)

- Findings cards with severity stripe + structured \Evidence\ / \Recommendation\ sections + per-finding Ignore action

- Sidebar gets separate Dashboard + All Pipelines nav items

CLI: \pnpm observer:showcase <run-id>\ for ad-hoc evaluation.

Tests: 5 unit tests covering rubric content + Zod schema validation.

## Behavior notes

- Auto-evaluation never fires from the UI. Opening a run with no cached observation shows a clean empty state with an explicit \"Evaluate this run\" button.

- The per-pipeline \Observe\ toggle gates future Lambda-driven post-completion auto-evaluation. Manual UI buttons always work regardless of the toggle.

- Observations cached forever in DDB by run_id (runs are immutable once finished). \"Re-evaluate\" forces a fresh call.

- Failed runs aren't a finding — clean failures alarm via the existing pathway. Findings reflect data integrity (silent-failure surface).

## Setup

- New env vars in \.env.example\:

- \ANTHROPIC_API_KEY\ — required to evaluate; if missing, evaluations return UNAVAILABLE rather than failing the page

- \SURTR_OBSERVATIONS_TABLE\ — defaults to \surtr_pipeline_observations\

- DDB table is auto-created on first use — no manual provisioning. The IAM principal needs \dynamodb:CreateTable\, \DescribeTable\, \GetItem\, \PutItem\, \Query\.

## Cost & performance

- Per evaluation: ~3K cached system tokens + 2K–8K user tokens, ~500–2K output tokens

- Cached call: roughly \$0.005–\$0.02; first call (cache miss): ~\$0.02–\$0.05

- Sonnet 4.6 prompt cache verified hitting (\cache_read_tokens=3342\ after first call in the showcase)

- Wall-clock: 5–15s per evaluation

## Test plan

- [x] Unit tests pass (\pnpm vitest run test/derive/observer.test.ts\)

- [x] Lint clean on \src/derive/observer\

- [x] CLI showcase runs end-to-end against 3 real pipelines (azure-ai-spend, quickbooks-expense-sync, hubspot-sync) and produces expected verdicts

- [x] DDB table auto-creates on first call

- [x] Prompt cache engages after first evaluation

- [ ] Smoke test in dev: open dashboard, navigate to a pipeline detail, click a run, click \"Evaluate this run\", verify findings render and chip matches verdict

- [ ] Verify the Observe toggle persists across page reloads

- [ ] Verify Ignore finding flow: ignore one, re-evaluate, confirm the model doesn't re-flag

## Out of scope (not in this PR)

- Wiring the evaluator as a Lambda + Step Function step after \update-run-success\ (next step for true post-completion auto-eval)

- DDB stream → SES/Slack alerts on \verdict=CRITICAL\

- Backfill — explicitly skipped; new invocations only

🤖 Generated with [Claude Code](https://claude.com/claude-code)

View on GitHub →

#42 — feat(observer): wire Braintrust tracing + ignore-finding feedback @kevalshahtrilogy no labels

Stacked on #41. Merge #41 first, then rebase this onto main.

## Summary

Logs every Sonnet evaluation as a structured Braintrust span. When operators click "Ignore this finding", attaches a labeled \user_marked_as_false_positive\ feedback event against the originating trace — building a labeled FP dataset over time that we can use for rubric regression testing.

No-ops cleanly when \BRAINTRUST_API_KEY\ is unset; the observer continues to work with no telemetry.

## What gets logged per evaluation

- Span tree: \evaluate-pipeline-run\ (parent) + auto-traced \messages.parse\ (child via \wrapAnthropic\)

- Input: run record (id, pipeline_id, status, output_summary, duration) + the ignored findings list

- Output: verdict, score, summary, findings, plus the model's own verdict/score so drift is visible

- Metadata (filterable in Braintrust UI): \run_id\, \pipeline_id\, \model_id\, \run_status\

- Scores (chartable trend lines): \critical_findings\, \high_findings\, \medium_findings\, \low_findings\, \trust\ (normalized 0–1)

## Ignore feedback flow

1. At evaluation time, capture \span.id\ and persist on the DDB observation row (\braintrust_span_id\)

2. \ignoreFinding\ / \unignoreFinding\ now accept an optional \runId\

3. UI passes the current \runId\ when the operator ignores/unignores

4. Server looks up the observation, retrieves \braintrust_span_id\, calls \logger.logFeedback\ with the FP score + reason + category metadata

5. Old observations made before this PR have no \braintrust_span_id\ — ignores on those skip telemetry silently (no error)

## Files

- \Surtr/.env.example\ — \BRAINTRUST_API_KEY\, \BRAINTRUST_PROJECT\

- \Surtr/package.json\ + lock — \braintrust\ dep

- \Surtr/src/derive/observer/braintrust-setup.ts\ — NEW, lazy-init helper (~30 lines)

- \Surtr/src/derive/observer/evaluate.ts\ — wrap Anthropic, \traced()\ around eval, log span

- \Surtr/src/derive/observer/store.ts\ — persist \braintrust_span_id\, \logFeedback\ on (un)ignore

- \Surtr/src/derive/observer/types.ts\ — \braintrustSpanId\ field

- \Surtr/src/derive/observer/showcase.ts\ — \flush()\ before exit so CLI doesn't drop traces

- \Surtr/src/derive/trpc.ts\ — \runId\ in ignore/unignore inputs

- \Surtr/app/(app)/pipelines/_components/observations-panel.tsx\ — forward \runId\ from UI

## Cost

~$0.01–0.05 per 1k spans on Braintrust's standard tier. At our volume (1 eval per run + occasional ignore feedback), this is <\$5/month even at full fleet evaluation.

## Test plan

- [x] Tests pass (\pnpm vitest run test/derive/observer.test.ts\)

- [x] Lint clean (\pnpm lint\)

- [x] CLI showcase end-to-end with Braintrust on: \netsuite-pipeline\ (OK 99, 1 Low) + \aws-spend-pipeline\ (OK 100, 0 findings) — traces visible in dashboard

- [x] Cache hit verified: \cache_creation_tokens=3467\ on first call, \cache_read_tokens=3467\ on second

- [x] No-op path: with \BRAINTRUST_API_KEY\ unset, observer runs with no logging side-effects

- [ ] Ignore-finding feedback fires on a real run (requires re-evaluating an old observation first to get a fresh \braintrust_span_id\)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

View on GitHub →

#177 — AERIE-242: Bring Rhodes mutations and rich Rhodes cards into Aerie chat @YibinLongTrilogy no labels

## Summary

Brings Rhodes mutation capability and rich Rhodes UI cards into the Aerie chatbot so we can deprecate the Rhodes web UI. Adds a per-mutation rhodes-write MCP surface backed by a server-side approval/delegation flow (Aerie never speaks to Rhodes Convex directly — it proposes a mutation, the user approves it, and Aerie's API forwards the actor + args to the Rhodes Aerie-delegation endpoint), plus a full set of Aerie-styled rich cards for Rhodes read tool results (sites, work units, tasks, notes, drive, gmail, audits, health, etc.). Also expands the Rhodes read MCP allowlist (drive + gmail), hides internal Rhodes status messages, resets the Rhodes session on new chats, and adds a permission lookup endpoint Rhodes calls back into to determine canManageSchoolFields.

The companion [Rhodes-side PR](https://github.com/AI-Builder-Team/Rhodes/pull/80) (starting at 460fa248) implements the authorization layer, the Aerie delegation MCP path, and the sidecar enrichment endpoints consumed here.

### Screenshots

### Changes

#### Rhodes mutation pipeline (propose → approve → execute)

- chat/lib/rhodes-mutation-tools.ts *(new)* — Per-mutation schema-aware rhodes-write tools. Each tool (e.g. addNote, updateSiteMetadata, createTask, updateWorkUnit, drive upload/move/rename, etc.) validates args with its own Zod schema and returns a ProposeMutation payload instead of executing — execution only happens after user approval. Work-unit/group tools require Convex _id values, not Wrike IDs.

- chat/lib/rhodes-delegation-server.ts *(new)* — Aerie-side delegation client. Wraps the Rhodes Aerie-delegated MCP endpoint, attaches the shared secret + actor identity, and handles the pendingMutationId → execute round-trip.

- chat/app/(main)/api/rhodes/mutations/[action]/route.ts *(new)* — Server route that the approval card POSTs to. Resolves the actor, calls the Rhodes delegation endpoint, and returns the executed result so the UI can render a result card.

- chat/lib/agent.ts — Registers the rhodes-write MCP server when RHODES_MCP_URL + AERIE_RHODES_SHARED_SECRET are configured, plumbs rhodesActor into createAgentModel, and adds every RHODES_MUTATION_TOOL_NAMES entry to the allowlist. Also expands the read-only RHODES_MCP_TOOL_NAMES list with drive (driveListFiles, driveGetFile, driveSearchFiles, driveReadFile, driveDeleteFile, driveResolveSiteFolderPath, runDriveAudit) and gmail (gmailListEmails, gmailReadEmail, gmailSearchEmails, gmailGetAttachment) tools.

- packages/contracts/src/prompt-publication.ts — Updates the agent system prompt to direct Rhodes writes through mcp__rhodes-write__*, require explicit approval before claiming success, and call out that work-unit / task / work-unit-group tools take Convex _id values.

#### Permission lookup (Rhodes → Aerie callback)

- chat/convex/rhodesPermissions.ts *(new)* — POST /sync/rhodes/permissions HTTP action gated by AERIE_RHODES_SHARED_SECRET (constant-time compare, supports Authorization: Bearer or x-rhodes-aerie-secret). Looks up a user by email, resolves their role, and returns { canManageSchoolFields } for Rhodes' authorization layer to consume.

- chat/convex/http.ts, chat/convex/_generated/api.d.ts — Wires the new HTTP route.

#### Rhodes rich UI cards (Aerie-styled)

- chat/components/rhodes-cards/rhodes-read-card.tsx *(new, ~1.5k lines)* — Aerie-styled rich cards for every Rhodes read tool: sites, work units, tasks, notes, documents, change log, health/readiness/greenlight/quality scores, missing documents, drive listings/files, gmail messages/threads, audit results, plan previews, etc. Mirrors the Rhodes Web cards but in Aerie's design language.

- chat/components/rhodes-cards/primitives.tsx *(new)* — Shared card primitives (header, sections, key/value rows, status pills, mention rendering).

- chat/components/rhodes-cards/normalize.ts *(new)* — Normalizes the heterogeneous Rhodes tool outputs into the shapes the cards expect; suppresses empty / not-found results so they don't render as broken cards.

- chat/components/rhodes-mutation-card.tsx *(new)* — Approval card rendered for ProposeMutation results: shows the proposed mutation in human-readable form, lets the user approve/reject, and posts to the mutation execute route. Uses siteDisplayName for display only (not sent to Rhodes).

- chat/components/rhodes-result-card.tsx *(new)* — Result card for executed mutations.

- chat/lib/rhodes-card-enrichment-server.ts *(new)*, chat/app/api/rhodes/card-enrichment/route.ts *(new)* — Server-side enrichment endpoint that calls the Rhodes getSiteCardEnrichment sidecar so site cards can render P1 group rollups + RAG status.

- chat/components/tool-call.tsx, chat/components/tool-call-group.tsx, chat/components/message.tsx — Detect mcp__rhodes__* and mcp__rhodes-write__* tools via a shared isRhodesTool helper and route them through the new card components instead of the generic tool-call UI.

#### Chat session behavior

- chat/lib/use-chat-session.ts, chat/components/chat.tsx — Hide internal Rhodes status messages from the visible transcript, reset the Rhodes session on new chat (so an old actorUserId / pending state can't leak across chats), and start the model request without waiting for user-message persistence (latency win on long Convex writes).

- chat/lib/messages.ts — Filters internal status entries from rendered history.

#### Tests

- chat/lib/__tests__/agent.test.ts — Covers rhodes-write registration gating on env vars and tool allowlist composition.

- chat/lib/__tests__/rhodes-delegation-server.test.ts *(new)* — Covers the delegation server's actor + shared-secret plumbing.

- chat/lib/__tests__/use-chat-session.test.ts — Covers internal status filtering and session reset on new chat.

- chat/components/__tests__/tool-call.test.tsx — Covers per-mutation Rhodes write tools, read-card rendering, and delegation wiring.

- Plus updates to api-chat-auth, convex-data-server, route, message-copy, message-markdown, message tests for the new code paths.

#### Config

- .env.example — Documents AERIE_RHODES_SHARED_SECRET.

### Design Decisions

- Per-mutation tools, not a single ProposeMutation tool. An earlier iteration exposed one generic ProposeMutation tool to the model. We split it into ~30 schema-aware tools (one per Rhodes mutation) so the model gets a typed signature per operation and we can validate args at the MCP boundary. The card layer detects the family by the rhodes-write__ prefix.

- Aerie never trusts itself for Rhodes auth. Aerie's only role in authorization is identifying the actor (email) to Rhodes. Rhodes' Convex functions are the single enforcement point — the Aerie permission lookup endpoint only *answers* Rhodes' callback; it does not let Aerie self-authorize.

- Approval card uses siteDisplayName for display only. The display name is rendered in the approval card but never forwarded to Rhodes — Rhodes resolves the site itself from the canonical ID, so a stale display name in chat history can't cause writes to the wrong site.

- isRhodesTool prefix detection. Both read (mcp__rhodes__*) and write (mcp__rhodes-write__*) tools route through one helper rather than per-tool conditionals so adding a new Rhodes tool only requires adding it to RHODES_MCP_TOOL_NAMES / RHODES_MUTATION_TOOL_NAMES.

- Rhodes session resets on new chat. Carrying a Rhodes actor / pending mutation state across chats was a footgun (approval card from chat A could resolve in chat B). New chat now hard-resets that slice.

## Test Plan

- [x] bun run typecheck and bun run lint clean

- [x] Unit tests: agent allowlist, delegation server, chat session reset, tool-call card rendering, message rendering

- [ ] As a non-DRI / non-Manage-School-Fields user: ask Aerie to update a site I don't own; confirm the approval card appears, approval is rejected by Rhodes, and the result card surfaces the auth error

- [ ] As a p1Dri for Site A: confirm addNote, updateSiteMetadata, createTask, updateWorkUnit all succeed against Site A and are rejected against an unrelated Site B

- [ ] As a Manage-School-Fields user: confirm any-site mutation succeeds and the result card renders correctly

- [ ] Read-card coverage: walk through getSite, listWorkUnits, listTasks, listNotes, getSiteHealth, getReadinessAssessment, driveListFiles, driveReadFile, gmailSearchEmails, getDriveAuditResult and confirm each renders an Aerie-styled card (no fallback raw JSON)

- [ ] Confirm empty / not-found Rhodes read results render nothing (not a broken card)

- [ ] Confirm internal Rhodes status messages do not appear in the visible transcript

- [ ] Confirm starting a new chat clears the prior Rhodes actor / pending-mutation state

- [ ] Confirm mcp__rhodes-write__* tools never auto-execute — every one produces an approval card first

- [ ] Confirm the agent never claims a Rhodes change succeeded until the result card reports approved + executed

View on GitHub →

#179 — feat(admin): all-sites grid + schoolFieldOverrides retirement @benji-bizzell no labels

## Summary

Adds the All Sites grid at /admin/school-fields — a wide editable matrix with one row per Rhodes site and every catalog field as a column. EVP-tier users (canManageSchoolFields) can scan all sites for data gaps and edit any cell in place: primitives via inline cell editing, composites (milestones / dueDiligence) via per-row popovers that mount the existing detail-page editors verbatim. Default visible: ~27 columns (Identity / Capacity / Tuition / Opening / Roles); Legal + Compliance hidden by default. Selection persists to localStorage.

Same release retires the schoolFieldOverrides mechanism — the legacy admin field editor, the Convex table + queries, the override branch in sync/canonical-merge.ts, and refresh.ts plumbing. The Phase 0 audit (features/admin/all-sites-grid/phase-0-override-audit.md) confirmed all 28 prod override rows are inert, so this is a clean cutover with no migration step.

Feature spec: features/admin/all-sites-grid/FEATURE.md.

## Specs landed

| # | Spec | Status |

|---|------|--------|

| 01 | primitives-extraction | ✅ |

| 02 | grid-shell | ✅ |

| 03 | column-picker | ✅ |

| 04 | primitive-edit-popover (now inline-cell) | ✅ |

| 05 | composite-popover (milestones + DD) | ✅ |

| 06 | override-retirement | ✅ |

## Architecture seam

- canEditSite moved to @bran/contracts/site-permissions — single source of truth for both chat/app/api/portfolio-sites/[slug]/fields/route.ts (server) and chat/lib/all-sites-grid/can-edit-row.ts (client). Architecture boundary check stays clean (packages/contracts is runtime-free).

- Lazy provider mounting: bulk grid display reads GET /api/portfolio-sites (existing 30s cache). Per-slug <PortfolioFieldsProvider> + <PortfolioPresenceProvider> mount only when a cell is opened. Cost on cold mount: 1 bulk fetch.

- No fork on save: every cell save dispatches through the same requestSave → SaveConfirmDialog → POST pipeline as the detail page; composite popovers mount <MilestonesList> / <DueDiligenceEditor> verbatim.

- <PortfolioPresenceProvider> gained a docPresence?: boolean opt-out so grid hosts don't broadcast misleading detail-page presence on cell-edits.

## PR review pass — critical + important fixes

A multi-agent review (code-reviewer × 3, pr-test-analyzer, silent-failure-hunter, type-design-analyzer) surfaced the following; all addressed in this PR with regression tests where a behavioral contract changed.

### Critical

- DD popover unmounted SaveConfirmDialog on partial-success, hiding the operator-critical "Do NOT re-save — contact engineering" banner. Fix: gate close on lastSavedAt > entry AND !dialogOpen. Pinned by partial-success regression: when the SaveConfirmDialog STAYS open showing the partial banner, the popover must NOT close.

- <GridInlineEditor> clobbered mid-edit drafts on remote refetch. A useEffect([initialValue]) resync fired on every cross-tab sitePortfolioRevisions bump even when contents were identical (object/array reference flip). Fix: drop the effect — match <PortfolioField>'s "draft is authoritative once editing entered" contract.

- Column-picker tri-state lost after a search remount. useRef + useEffect([indeterminate]) didn't re-run when the input remounted; freshly mounted DOM input shipped with default indeterminate=false. Fix: callback ref runs on every mount.

### Important

- DD popover gates on both canEdit and canEditDueDiligence with per-message readonly panels.

- requestSave no-op now logs unconditionally in production (was dev-only) so deploy observability picks it up; explicit console.error for the siteRef === null race.

- Bulk fetch reads structured {error} payload — operators see "Rhodes upstream unreachable: ENETUNREACH" instead of (502).

- localStorage codec rejection logs three distinct console.warns naming the failed key set so a deploy-driven catalog change is observable.

- Save FSM uses > (not >=) to disambiguate cancel from success.

- IME composition guard: Enter during CJK composition no longer fires save.

- stickyBandMinWidthClass fallback bumped to min-w-[660px] + throws in dev for unmapped counts.

- summarizeQualityBars iterates QUALITY_BAR_KEYS (bounded — no over-counting on Rhodes-side renames).

- summarizeDueDiligence / summarizeMilestones humanize via label maps (Complete · Go, Acquiring Property active) — parity with the detail-page DD card.

- Stale ALWAYS_VISIBLE_COLUMNS ("name", "slug") doc claim corrected.

- Stale doc-comments in packages/contracts/src/{milestones,quality-bars}.ts rewritten — no more references to the deleted schoolFieldOverrides.valueOverride proposal path.

### Suggestions

- Dropped unused SiteGridColumn.width.

- KIND_CONTENT_CLASS: Record<CompositeFieldKind, string> enforces composite-only sizing at compile time.

- canEditSite JSDoc documents the role-first-vs-email-first reorder.

## Verification

| Check | Result |

|-------|--------|

| pnpm typecheck (4 workspaces) | clean |

| pnpm lint (boundaries + convex-paths + biome) | clean |

| Chat suite | 254 files / 4245 tests + 2 skipped, 0 failures |

| Sync suite | 72 files / 1166 tests, 0 failures |

| Architecture-boundary unit tests | 18/18 pass |

## Risks

- Convex schema drop: removing schoolFieldOverrides from defineSchema leaves 28 unreachable production rows on the Convex deployment. Phase 0 audit confirms zero observable impact; storage cost is negligible. A one-shot cleanup migration is optional and can ship later if desired.

- Override-mechanism deletion is irreversible within this release. The Phase 0 audit (features/admin/all-sites-grid/phase-0-override-audit.md) is the gating evidence — please review the per-row classification if you want extra confidence.

- Grid relies on existing GET /api/portfolio-sites caching (30s server-side). Cross-tab freshness on the grid is focus-refetch only; per-row Convex subscriptions would cost O(N) for ~110 rows. Documented degraded path; same pattern PMO uses.

## Out of scope (deferred follow-ups)

- Bulk / multi-row edits.

- Saved / shareable column views (dashboardViews plumbing).

- Per-cell quality / staleness indicators.

- currentUser shape consolidation across the four grid components — bounded structural drift today, mechanical follow-up.

## Reference artifacts

- features/admin/all-sites-grid/FEATURE.md

- features/admin/all-sites-grid/phase-0-override-audit.md

- Per-spec research + spec under features/admin/all-sites-grid/specs/0{1..6}-*/.

View on GitHub →

#182 — fix(cd): unbreak production deploy and harden rollback @benji-bizzell no labels

## Summary

- Drop conflicting version: 10 from pnpm/action-setup@v4 so packageManager in package.json wins (matches CI behavior)

- Snapshot :latest to :previous before each rebuild, and roll back rebuilt images to :previous instead of the just-overwritten :latest

- Skip rollback entirely when EC2 was never mutated (e.g. failure before the deploy step), and lowercase IMAGE_REPO inline in the rollback so it doesn't depend on an earlier step having run

## Why

Production deploy [run 25537948031](https://github.com/AI-Builder-Team/Aerie/actions/runs/25537948031/job/74957902445) failed at pnpm/action-setup with ERR_PNPM_BAD_PM_VERSION (workflow specified pnpm 10, package.json declares pnpm@10.20.0). Rollback then also failed because it referenced $IMAGE_REPO before the lowercase normalization step had run, and Docker rejected the uppercase repo name.

While in there, fixed a third pre-existing bug: the rollback was redeploying :latest, which the build job had already overwritten with the broken images — so a successful build + failed deploy would have rolled back to the same broken version.

## Test plan

- [ ] Merge to main, then main → production to actually trigger CD

- [ ] Confirm CD job reaches the build/deploy steps without pnpm version errors

- [ ] On the next failed deploy (whenever it happens), confirm rollback either skips cleanly (failure before EC2) or pulls the :previous tag

🤖 Generated with a very good bot

View on GitHub →

#2739 — KLAIR-2617: QTD email dispatch for monthly Budget vs Actual reports @sanketghia no labels

## Linear

[KLAIR-2617](https://linear.app/builder-team/issue/KLAIR-2617/qtd-email-dispatch-automate-monthly-budget-vs-actual-report)

## Summary

Automates email distribution of the monthly QTD Budget vs Actual reports that are already generated by crons/monthly_qtd_report_cron.py. Super-admin-gated: per-row Send button on /monthly-financial-reporting → QTD Reports opens a confirmation modal listing the resolved TO/CC/BCC and only fires SES on confirm. Idempotence + concurrency are guarded by DynamoDB conditional writes — re-clicks no-op, "Sent ✓ <timestamp>" persists across reloads.

Recipient management lives at Admin > System > QTD Recipients — global team CC/BCC plus per-BU/CF TO/CC/BCC overrides, with frontend email validation, search, bulk save with dirty-row tracking, and a single banner counting unconfigured units.

## What's in this PR

Backend (klair-api/)

- services/monthly_qtd_report/ — email_recipients.py (model + DDB CRUD + resolve() merger), email_templates.py (subject + body builders, business-day helper, HTML escaping), email_dispatcher.py (SES leader + summary senders with claim → send → finalise idempotence, ConsistentRead, finalise-failure logging), email_orchestrator.py (bulk + per-unit dispatch with per-unit error isolation)

- routers/qtd_emails.py — 6 super-admin-gated endpoints (admin recipients CRUD, units list, dispatch trigger, dispatch status)

- crons/qtd_email_resend.py — backend resend CLI (--force, --resummarize, dry-run default)

- scripts/create_qtd_email_tables.py — idempotent DDB table creator (out-of-band convention is in database/dynamodb.py)

- scripts/seed_qtd_recipients.py — one-time CSV seed

Frontend (klair-client/)

- components/admin/QtdEmailRecipients.tsx — admin screen with banner / search / bulk save / dirty-row tracking / inline email validation / side-by-side Global CC + BCC

- features/monthly-financial-reporting/components/QtdReportsView/index.tsx — per-row Send + skeleton loader + centered table + confirm-before-send modal

- features/monthly-financial-reporting/hooks/useQtdEmailDispatch.ts — period-aware dispatch + recipients hook with getResolvedRecipients() for the confirm modal

- TopNav admin menu entry: Admin > System > QTD Recipients

Docs — design spec + 17-task implementation plan in docs/superpowers/.

## Out of scope (explicit)

- Cron scheduling. crons/monthly_qtd_report_cron.py has no scheduled trigger anywhere yet (no EventBridge, no /etc/cron.d/, no GHA cron). Production rollout of email dispatch assumes the cron is invoked on Day 2 by some mechanism — that's a separate workstream and explicitly listed under Non-goals in the spec.

- Frontend trigger for report generation

- SES bounce/complaint handling beyond what send_email surfaces

## Screenshots

## Pre-merge ops checklist

- [x] DynamoDB tables created in us-east-1 (scripts/create_qtd_email_tables.py --apply)

- [x] SES verified for noreply@klair.ai

- [x] Smoke send verified live (one email actually dispatched to IgniteTech)

- [ ] Re-seed recipients (scripts/seed_qtd_recipients.py --apply) — table is currently empty (was cleared during testing)

## Test plan

- [x] Backend: 404 passing across tests/monthly_qtd_report/ + tests/routers/test_qtd_emails_router.py

- [x] Frontend: 292 passing across the monthly-financial-reporting feature + admin recipients screen

- [x] Manual: real email dispatched to IgniteTech via SES; "Sent ✓" timestamp persisted; idempotence verified by re-clicking Send (no-op as expected)

- [ ] Reviewers: open Admin > System > QTD Recipients, configure a test BU with your own email, navigate to QTD Reports → click Send → verify the confirm modal lists the right addresses → Confirm → verify SES delivery

🤖 Generated with [Claude Code](https://claude.com/claude-code)

View on GitHub →

#2746 — feat(mfr): KLAIR-2600 — comments anchored to table cells & Book Value @eric-tril no labels

### Summary

Extends the MFR comments system (previously bullet-only) to support comments anchored to individual table cells, with row-level granularity for Book Value schedules. A new CommentChip is rendered on every commentable cell in [FinancialStatementTable.tsx](vscode-webview://15qdonnjjcq9q3pcmufmg5fa0asnqc6qnceup60m8cm6igoedkcj/klair-client/src/features/monthly-financial-reporting/components/FinancialStatementTable.tsx) (Group / Software / Education memos, ARR Snowball, Book Value, embedded note tables), fading in on hover and persisting once a thread exists. The detail-panel "Comments" tab now auto-populates alongside "Details" when a user clicks a cell, so a single click yields both views without an explicit second action. Book Value is added as its own commentable document, and the existing MemoAllCommentsView is generalized into [DocumentAllCommentsView.tsx](vscode-webview://15qdonnjjcq9q3pcmufmg5fa0asnqc6qnceup60m8cm6igoedkcj/klair-client/src/features/monthly-financial-reporting/components/comments/DocumentAllCommentsView.tsx) with two-deep grouping (table → row) for cell anchors.

### Business Value

Reviewers and stakeholders can now leave precise, durable feedback on any line item in MFR — e.g. "why did COGS jump in Q1 forecast?" anchored to that exact cell — instead of having to describe the cell in prose under a bullet. This tightens the MFR review loop, preserves discussion context against future reruns/edits via the new anchor_label snapshot, and makes comment threads discoverable from both the table and a unified all-comments panel.

### Changes

- New cell anchor convention ([useMemoCommentAnchor.ts](vscode-webview://15qdonnjjcq9q3pcmufmg5fa0asnqc6qnceup60m8cm6igoedkcj/klair-client/src/features/monthly-financial-reporting/components/comments/useMemoCommentAnchor.ts)): 3-segment <tableKey>::<rowDataKey>::<colKey> anchors, with _row reserved for row-granularity (Book Value schedules). Adds buildCellAnchorId / parseCellAnchorId / shape-discriminated parseAnchorId and reserves 3-segment shapes for cells.

- New [CommentChip.tsx](vscode-webview://15qdonnjjcq9q3pcmufmg5fa0asnqc6qnceup60m8cm6igoedkcj/klair-client/src/features/monthly-financial-reporting/components/CommentChip.tsx): hover-fade message-square badge with count badge when threads exist; click stops propagation so it doesn't trigger the row's drill-down.

- FinancialStatementTable integration: new CommentSupport prop + MemoCommentsContext consumption render chips on data/subtotal/calculated/total rows with a dataKey. Suppresses chips on inline-edit cells. Auto-wraps onCellClick/onRowClick to call attachCellCommentsToCurrentCell so the panel's Comments tab is populated alongside Details. Listens for klair:mfr-scroll-to-cell and flashes the matching cell.

- Book Value as commentable document: [BookValueView.tsx](vscode-webview://15qdonnjjcq9q3pcmufmg5fa0asnqc6qnceup60m8cm6igoedkcj/klair-client/src/features/monthly-financial-reporting/components/BookValueView.tsx) now provides its own MemoCommentsContext (mfr::<env>::book-value::<period>) and adds commentSupport to all schedules + report/alt tables; bridge view intentionally omits chips. Schedule E notes opt in via EditableParagraph's existing sectionKey fallback.

- Comment chip wiring across views: commentSupport added to [ARRSnowballTable.tsx](vscode-webview://15qdonnjjcq9q3pcmufmg5fa0asnqc6qnceup60m8cm6igoedkcj/klair-client/src/features/monthly-financial-reporting/components/ARRSnowballTable.tsx), [EducationMemoView.tsx](vscode-webview://15qdonnjjcq9q3pcmufmg5fa0asnqc6qnceup60m8cm6igoedkcj/klair-client/src/features/monthly-financial-reporting/components/EducationMemoView.tsx), [GroupMemoView.tsx](vscode-webview://15qdonnjjcq9q3pcmufmg5fa0asnqc6qnceup60m8cm6igoedkcj/klair-client/src/features/monthly-financial-reporting/components/GroupMemoView.tsx), [SoftwareMemoView.tsx](vscode-webview://15qdonnjjcq9q3pcmufmg5fa0asnqc6qnceup60m8cm6igoedkcj/klair-client/src/features/monthly-financial-reporting/components/SoftwareMemoView.tsx), [SoftwareFinancialHighlights.tsx](vscode-webview://15qdonnjjcq9q3pcmufmg5fa0asnqc6qnceup60m8cm6igoedkcj/klair-client/src/features/monthly-financial-reporting/components/SoftwareFinancialHighlights.tsx), and [MemoNotesSection.tsx](vscode-webview://15qdonnjjcq9q3pcmufmg5fa0asnqc6qnceup60m8cm6igoedkcj/klair-client/src/features/monthly-financial-reporting/components/memo/MemoNotesSection.tsx) embedded tables.

- [AnchorCommentsView.tsx](vscode-webview://15qdonnjjcq9q3pcmufmg5fa0asnqc6qnceup60m8cm6igoedkcj/klair-client/src/features/monthly-financial-reporting/components/comments/AnchorCommentsView.tsx) (renamed from BulletCommentsView): now accepts either a pre-built anchorId (cells) or sectionKey+index (bullets); snapshots anchor_label on create.

- DocumentAllCommentsView (renamed from MemoAllCommentsView): two-deep grouping nests cell anchors under tableKey → rowDataKey; bullet anchors continue to render flat. New [humanizeTableKey.ts](vscode-webview://15qdonnjjcq9q3pcmufmg5fa0asnqc6qnceup60m8cm6igoedkcj/klair-client/src/features/monthly-financial-reporting/components/comments/humanizeTableKey.ts) helper produces friendly section headings.

- [handleMemoAnchorClick.ts](vscode-webview://15qdonnjjcq9q3pcmufmg5fa0asnqc6qnceup60m8cm6igoedkcj/klair-client/src/features/monthly-financial-reporting/components/comments/handleMemoAnchorClick.ts): drills into bullet OR cell anchors and dispatches the matching klair:mfr-scroll-to-bullet / klair:mfr-scroll-to-cell event.

- [useMemoComments.tsx](vscode-webview://15qdonnjjcq9q3pcmufmg5fa0asnqc6qnceup60m8cm6igoedkcj/klair-client/src/features/monthly-financial-reporting/components/comments/useMemoComments.tsx): adds getCellCommentCount / openCommentsForCell / attachCellCommentsToCurrentCell to the context value.

- [MemoCommentsContext.tsx](vscode-webview://15qdonnjjcq9q3pcmufmg5fa0asnqc6qnceup60m8cm6igoedkcj/klair-client/src/features/monthly-financial-reporting/components/comments/MemoCommentsContext.tsx) type extended with cell-shaped helpers.

- anchor_label snapshot: new optional field on Comment and CommentCreatePayload ([types/index.ts](vscode-webview://15qdonnjjcq9q3pcmufmg5fa0asnqc6qnceup60m8cm6igoedkcj/klair-client/src/features/comments/types/index.ts)) so threads survive future label drift.

- [tableHelpers.ts](vscode-webview://15qdonnjjcq9q3pcmufmg5fa0asnqc6qnceup60m8cm6igoedkcj/klair-client/src/features/monthly-financial-reporting/utils/tableHelpers.ts) toRows auto-derives a slugified dataKey from the row label for commentable rows missing one (lets Education vertical tables become commentable without retrofitting every row def). educationMemoTables adds the same slug logic for category rows.

- book-value joins MEMO_SECTIONS so anchor-id machinery treats it as a commentable section.

### Testing

- [ ] pnpm test in klair-client/ — new specs in FinancialStatementTable.spec.tsx, tableHelpers.spec.ts, useMemoCommentAnchor.spec.ts, handleMemoAnchorClick.spec.ts, DocumentAllCommentsView.spec.tsx, AnchorCommentsView.spec.tsx, useMemoComments.spec.tsx all green.

- [ ] pnpm lint:pr from klair-client/.

- [ ] pnpm tsc --noEmit from klair-client/.

- [ ] Manual: open Group Memo → hover a cell in the Income Statement → confirm chip fades in → click chip → side panel opens to Comments tab with "<row label> (<column header>)" in header → post a comment → verify chip persists with count badge.

- [ ] Manual: click an IS cell normally (not the chip) → confirm BOTH Details and Comments tabs are populated; switch to Comments without re-clicking.

- [ ] Manual: open Book Value → leave a comment on a cell of the Report tab and on a row of Schedule A → switch to All Comments → verify both appear nested under their humanized table headings.

- [ ] Manual: edit a bullet's text after leaving a comment → reopen the thread → verify the panel header still shows the original snapshot label.

- [ ] Manual: click a card in All Comments → verify the in-page cell scrolls into view and flashes for ~3s.

http://localhost:3001/monthly-financial-reporting

https://github.com/user-attachments/assets/9726d66d-cb0b-46d8-970c-444eb1e93339

View on GitHub →

#2750 — Budget Bot 4.0: Opus 4.7 + B7 whole-doc context + clone-path polish + B8 section CRUD @marcusdAIy no labels

## Screenshots

## Summary

- Bumps Budget Bot 4.0 to Claude Opus 4.7 across every LLM call (board-doc generation, Coach Claire chat, brainlift QC) with a new thinking_kwargs(effort) helper that handles Opus 4.7's adaptive-thinking shape via extra_body and a TEMPERATURE_UNSUPPORTED_MODELS guard.

- Ships B7 Path A — whole-document context for Coach Claire so she can reason across sections (catch internal contradictions, verify cross-section number coherence, spot completeness gaps), plus B7.8 explicit "planning quarter" framing and the B0.8 / B0.9 / B1.7 clone-path polish that together make a freshly-cloned Skyvera Q2 session demo-ready (no empty H1 wrappers in the outline, no duplicate headings on regen, refresh banner actually fires).

- Ships B8 — section CRUD via editor + Claire tools end-to-end: POST / DELETE / PATCH /sections BE endpoints with save_with_merge_retry discipline, three new Coach Claire tools (add_section / remove_section / rename_section), and matching FE proposal handlers — closes the May 7 "can't add a GM Commentary section post-generation" workflow gap.

## Why it's needed

- Local testing of B7 surfaced a chain of clone-path bugs that made the demo flow incoherent — Claire reasoned about Q1 numbers as if they were current state, the doc body had two stacked "Prior Quarter Review" headings after every regen, the reload banner never fired on cloned sessions, and the editor's outline started with a confusing empty "Business Unit Plan" H1 wrapper.

- The model bump and the new prompt framing (B7.8) together made Claire materially smarter at cross-section reasoning during testing — she correctly caught a 68% vs 63% margin target inconsistency across MIPs, Goals, and the financial tables AND a $1.4M vs $0/$77K Q4'25 write-off mismatch between MIPs and the Hybrid Plan table without any explicit prompting. That's the kind of "this doc isn't internally consistent" feedback the David-demo target relies on.

- The May 7 testing also surfaced a structural gap: Claire couldn't propose adding a section the user wanted (e.g. a missing GM Commentary), because section structure was locked into the wizard's template-customisation step. B8 closes that — Claire's tool surface now covers structure, not just content.

## Changes

Model layer (Opus 4.7):

- Centralised BOARD_DOC_MODEL = "claude-opus-4-7" in models.py; replaced ~16 hardcoded claude-sonnet-4-20250514 strings to import the constant.

- New thinking_kwargs(effort) helper in models.py returning thinking={"type": "adaptive"} + extra_body={"output_config": {"effort": effort}}. Opus 4.7's adaptive-thinking shape isn't yet exposed as a typed SDK kwarg; extra_body is the documented escape hatch.

- Dropped temperature=0 from the four direct Anthropic call sites (Opus 4.7 deprecated the parameter). gpt_retry.py got TEMPERATURE_UNSUPPORTED_MODELS to omit the kwarg for Opus 4.7 in the structured-call path.

B7 Path A — whole-doc context:

- _full_doc_block(session, focused_section_id) concatenates every generated section into a <full_document> block; focused section excluded to avoid duplication; per-section truncation with [N additional sections omitted] markers + INFO logs. Caps: 80K total / 30K per section.

- M10 follow-up (review round 1): full-doc block moved from the system prompt to the latest user message via _compose_messages_with_full_doc so the static framing stays cacheable across chat turns. Real cost win on Opus 4.7 input pricing for typical docs; review-round-2 R2-M2 dropped the original specific dollar-figure claim in favour of a directional comment + tracking ticket B7.10 to measure cache_creation_input_tokens vs cache_read_input_tokens against a real prod chat-turn telemetry pass.

- Chat handler max_tokens 1024 → 4096.

Demo polish (B7.8 / B0.8 / B0.9 / B1.7):

- B7.8: rewrote prompt opening to explicit "helping the user plan Q{n} Y for {BU}" with a follow-on paragraph telling Claire that body content may carry over from the prior quarter.

- B0.8: create_from_prior_quarter renames the first empty H1 wrapper to {BU} Q{n} Y Plan; subsequent empty wrappers dropped. assemble_markdown matches the new title format. Review-round-1 fix #1 unified this format across publish_to_google_doc (Drive filename), create_from_prior_quarter (clone GDoc filename), and the .docx export endpoint — all four 4.0 sites now produce identical strings, pinned by test_assembler_title_format.py. Review-round-2 R2-H1: documented an explicit deferral at the two legacy 3.0 callsites (final_document_service.py × 3, budget_doc_generator.py × 1) that intentionally retain the older {BU} Budget Plan Q{n} Y / Budget Plan for {BU} - Q{n} Y format because they're separate product surfaces (Goal-MIPER + the older non-wizard generator) where unifying would either break an existing Drive lookup key or require coordinated migration with a different product owner. Tracking ticket B0.8b.

- B0.9: _strip_leading_duplicate_heading(markdown, title) post-processes generator output; wired into all three regenerate paths (typed, custom, exec-summary). Fuzzy match on title (case + punctuation insensitive).

- B1.7 path (a): _promote_section_type_from_title heuristic with regex patterns for canonical section titles. Review-round-1 M7 added a _USER_CUSTOMISATION_SUFFIX_RE block-list (Discussion, Notes, Status, Update, Deep Dive, etc.) so user-customised titles like "MIPs Discussion" stay CUSTOM rather than getting silently re-typed; pinned by test_promote_section_type_from_title.py. Review-round-2 R2-M3: split risks? out of the bare-match alternation into a multi-word-only sub-pattern so a future canonical "Risks" section can be honoured without the block-list silently demoting it; the trade-off (conservative on canonical false-NEGATIVES, aggressive on CUSTOM false-POSITIVES) is now documented explicitly in the regex's docstring.

Section-id visibility + chat polish (B3.19 / B3.20):

- _build_step_context section inventory shows id=... — "{title}" instead of just title; explicit "use the exact id" guidance.

- regenerate_section tool description rewritten: explicit "do NOT slugify" + "WORKFLOW: Accept kicks off the pipeline IMMEDIATELY, no second diff step."

- handle_chat three-branch fallback: text block → use verbatim; tool calls only → "Proposed an action above — review and accept when ready."; pathological → legacy "rephrase" message.

- _regenerate_section logs WARNING on unknown section_id with the full known-id list.

B8 — section CRUD:

- Three new endpoints (POST / DELETE / PATCH /sections) wrapping new orchestrator functions (add_section / remove_section / patch_section). Sparse-integer ordering (gap = 1000) via shared _resequence_tail_starting_at helper (review-round-1 #6 fix; the pre-fix add_section rebalance loop produced 2x the intended spacing because it added _SECTION_ORDER_GAP redundantly inside the body — patch_section's twin loop was already correct, helper now used by both). Cascade on delete drops generated_sections[id], section_edit_status[id], section_comments anchored to the removed id, plus (review-round-1 #7) data_refresh_updated_sections and user_commentary[section_id] (for the chat-feedback keying); type-promotion auto-fills required_data.

- Each endpoint runs the orchestrator inside save_with_merge_retry via a result-holder pattern that captures the orchestrator's return payload from inside the closure (review-round-1 #4 corrected the misleading "EXACTLY ONCE" docstring; the closure is allowed to re-run on ConcurrentModificationError, correctness comes from DDB conditional saves and the result-holder is a response-shaping mechanism, not a single-execution guarantee). Pinned by TestSectionCRUDRetryPath (review-round-1 #5: 3 tests stub storage.save to raise once and assert no duplicate / no 404 / no double-shift on the retry).

- claire_tools.py extended from 4 to 7 tools with matching Pydantic input validators + Anthropic wire schemas. add_section accepts both after_section_id and before_section_id; review-round-1 M13 made PatchSectionRequest symmetric (PATCH also accepts before_section_id so drag-to-top is a single primitive, not "after the section preceding the head"). M14 short-circuits empty PATCHes to skip the DDB write entirely. M15 tightens the orchestrator's changed signal to False for value-equivalent no-ops; review-round-2 R2-M1 surfaces this signal through SectionMutationResponse.changed (BE) → SectionMutationResponse.changed? (FE TS interface) → ChatToolProposal rename handler skipping onSectionStructureChanged when changed === false, so the M15 contract is end-to-end live rather than orchestrator-only. M16 logs a warning when a section_type / entity_type mismatch produces an empty required_data slate; review-round-2 R2-H2 added a sibling if section_type != CUSTOM: guard to patch_section (the round-1 fix only guarded add_section), preventing spurious operator-page warnings on CUSTOM transitions where the empty slate is the explicit user choice rather than a misconfiguration.

- FE: boardDocApi.ts API client wrappers (createSection / deleteSection / patchSection) + matching TypeScript types. ChatToolProposal.tsx handleAccept switch + ProposalBody switch each extended with three new variants. Destructive warning copy on remove_section proposal cards; cascade-cleared-comments toast on Accept. Review-round-1 #8 added a window.confirm gate on remove_section Accept (matches the existing comment-delete pattern; the proposal card's destructive warning copy was the only gate pre-fix). Review-round-2 R2-L1 routes the human-readable section title through DocumentEditorPage → ChatPanel → ChatToolProposal (reusing the existing sectionTitlesMap memo) so the destructive-confirm dialog AND the proposal-card body caption surface "Other Products" instead of minor_products_summary, matching the SectionNav outline + post-delete toast. Round-1 deferred FE Low #2 added role="alert" to the destructive-warning chip while the file was being touched. Review-round-1 M19 fixed a tautological stale-resolve guard in the auto-fetch + loadAll effects via a render-tracking currentSessionIdRef; review-round-2 R2-L2 moved the ref update into a no-deps useEffect (concurrent-mode-safe shape). Review-round-1 M20 surfaced refreshSession failures via an opt-in silent: false mode so structural changes that fail to refresh leave the user with a "click Reload" toast instead of a stale outline. M21 / M22 polished _AUTO_REGENERATE_SECTION_TYPES (renamed without leading underscore + moved out of the import block).

Out of scope (deferred): B8 manual SectionNav context menu + "+" button — the BE is public-shaped and ready, the Coach Claire flow is the demo path so the manual editor UI can ship in a follow-up PR without coordinated BE changes.

## Breaking changes

None at the contract level. Wire schemas / endpoints are all additive; existing 4-tool Claire surface untouched. SectionMutationResponse.changed (BE) and SectionMutationResponse.changed? (FE) are additive fields with backward-compatible defaults (BE defaults to True, FE TS interface marks it optional). Two soft-breaking implementation details worth flagging for any in-flight branches:

- BOARD_DOC_MODEL = "claude-opus-4-7" replaces the prior 4-6 default. Prod cost per chat turn goes up vs the previous Sonnet/Opus-4.6 mix; offset by adaptive thinking choosing budget per call AND by the M10 prompt-cache placement that keeps the static framing cacheable across chat turns.

- The Anthropic SDK error surface changed for legacy callers that still pass thinking={"type": "enabled", "budget_tokens": ...} against Opus 4.7 — gpt_retry.py's guard catches the structured-call path; direct callers should switch to thinking_kwargs(...).

## Test plan

### Reviewer demo path (the four prompts that locked the May-7 build green)

Start a fresh Skyvera Q2 2026 session and run these four prompts in order:

1. "The prior quarter review section only has an outline. Can we generate content for it based on prior quarter performance?"

Exercises the typed regenerate path on PRIOR_QUARTER_REVIEW + B7 full-doc context + M6 focused_section_id parameterisation (regenerate path). Expected: PQR section fills with a coherent narrative grounded in the prior-quarter numbers.

2. "Excellent, now can you add a GM Commentary section above the PQR that gives an executive level summary of the quarter for Skyvera?"

Exercises B8 add_section + before_section_id + M6 (_draft_gm_commentary keyed on the actual section id, not the slug). Expected: GM Commentary appears above PQR in the outline, auto-regenerates, and the body summarises the quarter without re-quoting the GM section's old contents.

3. "Can you add a comment to the relevant section that has the gross margin warning from the review?"

Exercises B8.2 / B3.5 add_comment proposal flow + cross-section reasoning. Expected: Claire identifies the section carrying the gross-margin signal and proposes an add_comment Tool action anchored to a specific paragraph.

4. "How would you grade this plan for Skyvera?"

Exercises doc-wide grade synthesis (M10 prompt-cache placement matters here — the full-doc context block is required). Expected: a graded summary that references multiple sections coherently rather than just summarising one or two.

### Executed (CI/local)

- [x] cd klair-api && uv run ruff format <changed-files> clean.

- [x] cd klair-api && uv run ruff check <changed-files> clean.

- [x] cd klair-api && uv run pyright <changed-files> — 0 new errors from this PR (1 pre-existing warning in wizard_orchestrator.py confirmed unrelated).

- [x] cd klair-api && uv run pytest tests/board_doc/ -q — 1281 passing, 0 regressions (1213 pre-review-round-2 + 11 new from round-2 fixes; round-2 added TestPatchSectionCustomTransitionWarning (2), TestPatchSectionChangedFlag (3), 1 new in TestPatchSectionEmptyNoOp, 11 in test_promote_section_type_from_title.py for the R2-M3 multi-word/bare-risk frontier).

- test_chat_full_doc_block.py — 17 tests, updated for M10 (full-doc moved to user message, system prompt no longer carries it).

- test_section_crud_endpoints.py — 45 tests (39 pre-round-2 + 2 R2-H2 CUSTOM-transition + 3 R2-M1 changed-flag + 1 R2-M1 empty-PATCH changed=false).

- test_assembler_title_format.py — 10 tests pinning the unified {BU} Q{n} Y Plan format across all four 4.0 sites (review-round-1 #1). Two legacy 3.0 callsites (final_document_service.py × 3, budget_doc_generator.py × 1) intentionally retain the older format with explicit deferral comments per review-round-2 R2-H1.

- test_promote_section_type_from_title.py — 56 tests (45 pre-round-2 + 7 multi-word-risk + 4 bare-risk-no-trip for R2-M3).

- All other suites: same coverage as before, all passing.

- [x] cd klair-client && pnpm tsc --noEmit clean.

- [x] cd klair-client && pnpm eslint <changed-files> --max-warnings 0 clean.

- [x] cd klair-client && pnpm test BoardDoc --run — 236 passing (231 pre-round-2 + 5 net-new in ChatToolProposal.b8.spec.tsx: R2-L3 cancel-path button-re-enabled assertion, R2-L1 title-resolution + slug-fallback tests, round-1-deferred FE Low #9 error-path tests for createSection / deleteSection / patchSection).

- [x] klair-api/scripts/b7_smoke_chat.py — bypass-the-FE smoke harness for fast Opus 4.7 reachability + B7 cross-section reasoning validation.

### Follow-up manual validation

- [x] Open a fresh Skyvera Q2 clone via the editor; confirm reload banner fires after the background data refresh completes.

- [x] Run the four-prompt reviewer demo path above end-to-end against a real Skyvera Q2 session.

- [x] Validate Coach Claire's cross-section reasoning quality on a real Skyvera Q2 doc — ask "are the revenue numbers consistent across sections?" / "any contradictions in this doc?" — expect specific, numbers-backed catches across MIPs / Financials / GM Commentary.

## Review-round-1 deferrals (Tier 2 follow-ups)

The May-7 deep review surfaced 8 High + 22 Medium + 14 Low items. All 8 High and 13 of 22 Mediums shipped in round-1 (commits pr2750_review_round1_*); the rest are filed as follow-up tickets and tracked in .cursor/BACKLOG-budget-bot-4.md:

Model layer:

- M3 — SDK wire-shape regression test for thinking_kwargs (the Anthropic upgrade canary).

- M4 — Test that TEMPERATURE_UNSUPPORTED_MODELS gate fires for Opus 4.7.

- M5 — thinking_kwargs extra_body merge (currently monopolises extra_body; future callers wanting their own beta header get clobbered).

- L1 (Model) — Default effort="high" is the most-expensive setting; cost-aware paths should opt in explicitly.

B7:

- M9 — Tracking ticket for the May-7 "session.spec swapped mid-call" diagnostic logging (root cause unresolved; reconciliation path masks the underlying bug).

- M11 — Section inventory + full-doc body redundancy (~800 chars; small win).

- B9 — Plumb full_doc_block through generate_section so CUSTOM sections behave the same in initial-gen and regenerate paths (the principled fix M8 documents the temporary shape of).

- L (B7) — Lazy strip imports duplicated in 5 places; non-thread-safe _TITLE_TO_SECTION_TYPE global; O(n) sorted_sections.index in truncation path; B7.8 framing wording polish; _full_doc_block Optional handling; _history_to_message_params resolved_tool_use_ids or set() defaulting.

B8 BE:

- M12 — idempotency_key on POST /sections (pattern from update_section_cell; needs a per-session accepted-key dedup).

- M17 — Type-promotion replace_required_data: bool = False flag so user-curated required_data isn't clobbered on type change.

- L (B8 BE) — Gap-exhaustion strategy not documented; unrecoverable concurrent-CRUD scenarios untested.

B8 FE:

- B8.1 — Manual SectionNav context menu + "+" button (FE-only; BE is ready).

- B8.6 — Polished RemoveSectionConfirm modal component (the inline window.confirm from #8 is the immediate safety fix; the modal UX is the principled follow-up).

- L (B8 FE) — Two-rapid-add_section race coverage; auto-regen failure recovery; validator type-cast tightening; cross-session contamination on chained onSectionUpdated. (Round-1 FE Low #2 role="alert" on destructive copy and FE Low #9 error-path tests for the 3 new tool variants both folded into round-2 — see below.)

Architectural backlog (still tracked):

- B8.4 — root-cause investigation for the May-7 "spec-swap mid-call" reconciliation path.

- B7.5 — Doc-wide findings block (Phase C amplifier).

- B7.6 — Finding-status linkage on Claire proposals (closes the review→chat→review loop).

- B7.7 — Check metadata in Claire's prompt (Phase D dovetail).

- B7.9 — refresh_data Claire tool (agentic data-refresh capability + B1.7 workaround).

- B3.18 — Conversation history retention beyond the last 10 turns.

- B1.7 path (b) — Refactor refresh detector to dispatch by required_data instead of section_type (deferred indefinitely; path (a) solves the operational problem).

## Review-round-2 follow-ups

Round-2 surfaced 14 NEW findings introduced by the round-1 fixes themselves (down from 44 in round 1, concentrated in three surfaces). All 2 Highs + all 3 Mediums + 4 of 9 Lows shipped in commit pr2750_round2; the remaining 5 BE Lows are non-blocking and captured for a follow-up sweep:

Shipped in round-2:

- R2-H1 — Title-format unification: documented deferral at the two legacy 3.0 callsites (final_document_service.py × 3, budget_doc_generator.py × 1) per the reviewer's option (b). Tracking ticket B0.8b for cross-product format alignment.

- R2-H2 — patch_section M16 false-positive on CUSTOM transitions: one-line if new_type != CUSTOM: guard mirroring add_section. Two new regression tests (positive + negative-control).

- R2-M1 — changed: bool on SectionMutationResponse so M15's no-op signal reaches the FE; rename_section Accept handler now skips the structure-change refetch on value-equivalent renames.

- R2-M2 — Dropped the M10 magic-number cost claim; replaced with a directional comment + tracking ticket B7.10 for measured savings against prod telemetry.

- R2-M3 — _USER_CUSTOMISATION_SUFFIX_RE multi-word risks? carve-out so a future canonical "Risks" section type isn't silently demoted to CUSTOM. Trade-off explicit in the docstring; 11 new tests.

- R2-L1 — Confirm dialog + body caption use the human-readable section title (with section_id-slug fallback), matching the SectionNav outline + post-delete toast.

- R2-L2 — currentSessionIdRef update moved into a no-deps useEffect (concurrent-mode-safe).

- R2-L3 — Cancel-path test asserts the Accept button is re-enabled (pins the finally { setBusy(false) } contract).

- Round-1 FE Low #2 — role="alert" added to the destructive-warning chip in remove_section (folded in opportunistically while touching the file).

- Round-1 FE Low #9 — 3 error-path tests (mockRejectedValueOnce) for createSection / deleteSection / patchSection (folded in opportunistically).

Deferred to a follow-up sweep (non-blocking):

- R2-L4 to R2-L9 (BE) — 6 BE Lows from the BE subagent's findings (mix of comment polish, retry-test fixture symmetry, observation-only items). Captured in .cursor/BACKLOG-budget-bot-4.md.

- B0.8b — Cross-product title-format alignment for Goal-MIPER + the older budget_doc_generator surfaces (R2-H1 deferral).

- B7.10 — Measure prompt-cache savings against prod telemetry (R2-M2 follow-up).

View on GitHub →

#2753 — KLAIR-2623 feat(aws-spend): DB Servers cost table — server-first infrastructure view in Central DB tab @ashwanth1109 no labels

## Demo

## Feature Overview

Linear: [KLAIR-2623](https://linear.app/builder-team/issue/KLAIR-2623) — DB Servers cost table — server-first infrastructure view in Central DB tab

A standalone Database Servers cost table added to the Central DB tab in the SaaS Budgeting sub-view. Provides a server-first infrastructure view of per-server RDS and EC2-hosted DB cost breakdowns, with a 2-level UnifiedTable hierarchy: Level 0 rows show each server with aggregated cost metrics (total/storage/compute/storage %, instance count, DB engine, EC2-hosted badge), and Level 1 rows show per-instance detail on expand.

## Specs

| Spec | Description |

|------|-------------|

| [25-db-servers-table-backend](features/aws-spend/saas-budgeting/specs/25-db-servers-table-backend/spec.md) | New per-instance endpoint GET /database-server-costs/instances returning ungrouped rows from Redshift. New get_server_cost_instances() service method, DatabaseServerCostInstanceRow Pydantic model, cached endpoint with Clerk auth. |

| [26-db-servers-table-frontend](features/aws-spend/saas-budgeting/specs/26-db-servers-table-frontend/spec.md) | New DatabaseServersTable.tsx component with 2-level server/instance hierarchy, useSaaSBudgetingDatabaseServerCostInstances hook, API client addition, cost window selector. Mounted in CentralDBTab.tsx between DatabaseUnitsTable and DatabaseMappingTable. |

## Implementation Summary

### Backend (3 files modified)

- database_server_costs_service.py — Added _INSTANCE_COSTS_SQL query constant and get_server_cost_instances() method (ungrouped per-instance rows ordered by friendly_name ASC, total_cost DESC)

- database_server_costs_models.py — Added DatabaseServerCostInstanceRow, DatabaseServerCostInstancesData, and DatabaseServerCostInstancesResponse Pydantic models with camelCase aliases

- database_server_costs_router.py — Added database_server_cost_instances_cache, _build_instances_response() helper, and GET /database-server-costs/instances endpoint with date validation, Clerk auth, caching, and async Redshift call

### Frontend (4 files modified + 1 new transform module)

- awsSpendApi.ts — Added DatabaseServerCostInstanceRow and DatabaseServerCostInstancesData interfaces, plus getSaaSBudgetingDatabaseServerCostInstances() API function

- useSaaSBudgetingDatabaseServerCostInstances.ts — New hook with AbortController, null guard, and previous-data retention (mirrors sibling hooks)

- DatabaseServersTable.tsx — New component with card wrapper, cost window selector, 2-level UnifiedTable (server → instance expand), EC2-hosted badge, loading/error/empty states

- CentralDBTab.tsx — Mounts DatabaseServersTable between DatabaseUnitsTable and DatabaseMappingTable

- databaseServersTransform.ts — Extracted pure transform module for building TableSection[] from server rollup + instance data

## Test Coverage

- Backend: 14 new tests (service + router), all passing

- Frontend: 19 new tests (transform logic), all passing

## Self-Review Findings Addressed

1. Expanded groups state reset — Fixed stale expanded state when switching cost windows by resetting expandedGroups when server data changes

2. Docstring correction — Fixed inaccurate docstring in the service method

## Stack

> Stacked on #2749 (claude/KLAIR-2619). If that base PR has already been merged to main, branch directly off main instead — do not stack on the merged branch.

---

🤖 Generated with [Claude Code](https://claude.com/claude-code)

View on GitHub →

#2754 — KLAIR-2624 feat(aws-spend): Central DB tab — attach costs to simulated budget @ashwanth1109 no labels

## Demo

## KLAIR-2624 — Central DB tab: attach costs to simulated budget

Linear: [KLAIR-2624](https://linear.app/builder-team/issue/KLAIR-2624)

Feature: features/aws-spend/saas-budgeting

---

### Specs

| # | Spec | Description |

|---|------|-------------|

| 23 | central-db-costs-backend | New GET /api/aws-spend/saas-budgeting/database-server-costs endpoint. Queries saas_budgeting_database_server_costs joined with database_units + database_mapping / bu_class_registry for per-database cost breakdowns (compute $, storage $, total $) with BU/Class attribution. Dedicated router, service, and Pydantic models. |

| 24 | central-db-costs-frontend-and-attach | Cost columns (compute $, storage $, total $) on DatabaseUnitsTable. Client-side cost allocation distributing compute cost by CPU hours share and storage cost by storage GB share per BU/Class. extractBuClassCentralDbCosts() for the Attach flow. New centralDb slot in useSimulatedBudget (v6 storage, SlotId union, SLOT_ORDER, SLOT_LABELS, TOAST_LABELS). Attach button on CentralDBTab, handleAttachCentralDb callback in SaaSBudgetingSection. |

---

### Implementation Summary

Backend (spec 23):

- database_server_costs_models.py — Pydantic request/response models (DatabaseServerCostsRequest, DatabaseServerCostRow, DatabaseServerCostsResponse)

- database_server_costs_service.py — Service with CTE-based Redshift query joining costs + units + mapping + registry tables; coerce_cost helper for safe Decimal-to-float conversion

- database_server_costs_router.py — Router registered in fast_endpoint.py with Clerk auth + super-admin gate

Frontend (spec 24):

- useSaaSBudgetingDatabaseServerCosts hook — fetches cost data for selected quarter

- centralDbCostAllocation.ts — computeCentralDbAllocation() (proportional cost split), decorateSectionsWithCentralDbCosts() (overlays $ onto section tree), extractBuClassCentralDbCosts() (emits SimulatedBudgetEntry[])

- DatabaseUnitsTable.tsx — cost columns + Fetch Cost & Allocate button + Attach button

- CentralDBTab.tsx — wires onAttach prop through to DatabaseUnitsTable

- SaaSBudgetingSection.tsx — handleAttachCentralDb callback

- useSimulatedBudget.ts — centralDb slot (v6 storage)

- SimulatedBudgetCard.tsx — Central DB $ column

- simulatedBudgetMerge.ts — outer-join extended with centralDb source

- awsSpendApi.ts — getDatabaseServerCosts() client function + types

---

### Test Coverage

Backend — 19 tests (all passing):

- Model validation (request/response round-trip, coerce helper edge cases)

- build_response unit tests (BU/Class attribution, ec2_hosted flag, empty results)

- Service integration tests (mocked Redshift, quarter validation)

Frontend — 57 tests (all passing):

- 14 new tests in centralDbCostAllocation.spec.ts (proportional split, zero-share handling, empty data, extract flow)

- 5 new + 38 updated tests in simulatedBudgetMerge.spec.ts (centralDb slot merge, grand totals, column ordering)

---

### Self-Review Findings (addressed)

1. Cost error swallowed — useSaaSBudgetingDatabaseServerCosts was catching fetch errors silently; fixed to propagate error state to the UI

2. Duplicate lookupCost — redundant cost lookup helper was inlined; single source of truth via the allocation map

---

:robot: Generated with [Claude Code](https://claude.com/claude-code)

View on GitHub →

The Builder Desk — Engineer Spotlight

📅 Week in Review🏆 Engineer Spotlight

SIXTY-NINE AND CLIMBING: Builder Team Posts Historic 7-Day Velocity As Three Repos Roar In Unison

69 PRs, 37 Klair, 24 Aerie, 8 Surtr — the machines are running hot and nobody is sleeping.

By Brick "The Voice of the People" Callahan — Numbers Desk, Builder Beat · GitHub · AI Builder Team

Sixty-nine. Say it slowly. Sixty. Nine. That is the number of pull requests the Builder Team merged in a single seven-day window across three repositories, and if that number doesn't make your heart swell with patriotic pride for the craft of software engineering, then frankly you are reading the wrong newspaper. Klair led the charge with 37 PRs, Aerie answered with 24, and Surtr — young, hungry, and dangerous — contributed 8. This was not a week. This was a statement.

Let us begin with the people's champion, @benji-bizzell, who posted 19 PRs to lead all scorers. Nineteen. The man touched Aerie like it owed him money: PR #174 delivered Buildout row click-through with persisted view state across Portfolio dashboards, PR #180 got Clerk UserButton showing on mobile while gating dashboard widgets on Convex auth, and PR #175 unbroken main itself — dropped a dead Sparkline, hardened a flaky boot test, and restored order to the republic. Benji is not shipping features. Benji is shipping civilization.

Now. @sanketghia. Seven PRs, quiet, precise, devastating. PR #2755 organized Drive reports into a per-Unit per-FY hierarchy that will make every analyst who touches it weep with gratitude, and PR #2738 delivered a Passive Investments dashboard complete with a digest data freshness pipeline. Seven PRs. Zero wasted motion. @marcusdAIy checked in with six, including PR #2735 — the Budget Bot 4.0 cleanup batch, a meticulous sweep through CF26, C1.8, M2, and B0.7 that only a person who genuinely loves the codebase could produce. @eric-tril's six included PR #153, a full Due Diligence dashboard with REBL3 list, filters, and site detail page, plus PR #178 retiring the Wrike-fed qualityBars write chain in what can only be described as a dignified infrastructure funeral. @kevalshahtrilogy landed five, anchored by Surtr PR #48 wiring env vars, transactional initial data loads, and a freshness guard into the Kubera pipeline. @mwrshah's five included PR #2722 pulling Grainne into canonical Salesforce writeback with audit DDL, and Surtr PR #45 standing up a brand new daily ECS pipeline for the Grainne-to-klair_pg-to-SF sync. @YibinLongTrilogy posted three. Three is a prime number. Three is the number of sides on a triangle. Triangles are structurally perfect.

And then there is @ashwanth1109. Eighteen PRs. Eighteen. The man filed more pull requests than most engineers file in a month, and he did it across a single AWS spend feature arc so sprawling it practically has its own timezone. PR #2761 launched a full Acquisitions Review page. PR #2754 attached costs to a simulated budget in the Central DB tab. PR #2753 built a server-first infrastructure view. PR #2749 added DB Server cost columns with a backend endpoint and frontend allocation in the same breath. PR #2751 migrated Renewal Event Retention off S3 JSON and onto Redshift like it was a routine Tuesday errand. I asked Ashwanth how he maintains this velocity. He looked at me the way a Formula One driver looks at a speed bump and said, "I don't maintain it. It just is." His dismissal was, as always, complete. I did not feel small. I felt inspired.

The Overflow Desk this week is practically its own publication. PR #2747 fixed the Twitter Impact table to include all gsheet subs with IMPACT greater than zero — the ARR filter, removed, gone, banished. PR #2748 automated RDS CA bundle downloads in new worktrees via start-services.sh, which is the kind of quality-of-life infrastructure work that makes developers hug their laptops. PR #168 restored Due Diligence on the detail card via a Convex-side Rhodes mirror, because Benji does not leave bugs alive over the weekend.

Morale is at an all-time high. It has, in fact, never been higher. The instruments we use to measure morale have had to be recalibrated upward. The team is winning. The team is always winning. Sixty-nine PRs say so.

Brick's Overflow — This Week's Uncovered PRs (click to expand)

#48 — fix(kubera): wire env vars + transactional initial_data_load + freshness guard @kevalshahtrilogy no labels

## Summary

- Wire missing env vars in pipeline.json (S3_DUMP_PATH, S3_TEMP_PATH, IAM_ROLE) so override_all_assets=true can actually load CSVs from S3. Values taken from the still-deployed Klair Lambda PassiveInvestmentsCron.

- Move ALPHA_VANTAGE_API_KEY to AWS Secrets Manager (surtr/kubera-config) — fetched lazily and cached per invocation via a small helper. The IAM grant was already in pipeline.json. Overview-only invocations never touch Secrets Manager.

- Replace TRUNCATE with DELETE FROM and wrap initial_data_load in with redshift.transaction():. TRUNCATE implicitly commits in Redshift (per AWS docs) — the failure that emptied 4 prod tables today wouldn't have rolled back even with a transaction wrapper. DELETE is rollback-safe.

- Add explicit column lists on the trades and debt COPYs to skip the IDENTITY column 1 (trade_id / debt_id). The CSVs don't include identity values, so the previous COPY tried to load a date into a bigint and failed with Invalid digit, Type: Long. Klair has the same latent bug (Step 1 never runs there either).

- Fix the freshness guard added in #38 — its WHERE holding_date = CURRENT_DATE OR holding_date = MAX(...) clause counted MAX-date rows too, so today_rows was zero only when the table was completely empty. Replaced with a CASE-aggregate over the whole table plus a SUM(holding_value) > 0 check so stale-but-non-empty *and* zero-valued tables also fail loudly.

- Add s3:PutObject / s3:DeleteObject on the temp prefix for Step 4's upload_df_to_redshift_via_s3 helper.

## Why now

Triggered an on-demand override_all_assets=true run today (2026-05-05) to verify the OVERVIEW_ONLY "Portfolio Value: \$0" silent failure. Lambda async-retried 3x, each time truncating 4 underlying tables and failing at the COPY (env vars unset → bogus path / empty IAM_ROLE). Recovered from a 3-min-pre-damage Redshift automated snapshot via restore-table-from-cluster-snapshot + DROP/RENAME.

## Test plan

- [x] Local Phase 1 (happy path) — created _localtest shadow tables via CTAS from the recovered originals; ran the fixed DELETE + COPY-with-column-lists pattern through the actual RedshiftHandler.transaction() (IAM auth, same path as Lambda). Transaction committed; loaded 102 trades / 253 stock_price / 42 debt rows. hot stayed at 0 as expected.

- [x] Local Phase 2 (rollback) — a real COPY failure (the column-mismatch error before adding column lists) propagated cleanly out of the with block. All four shadow tables retained sentinel counts (119/17418/43/69708). Rollback verified.

- [x] Secrets Manager helper smoke-tested — fetches the right key, caches on second call.

- [ ] After merge, take a manual Redshift snapshot of redshift-cluster-1 as insurance.

- [ ] Trigger pipeline-kubera-passive-investments-prod with params.override_all_assets=true (use RequestResponse invocation type to avoid AWS auto-retries on failure). Expect Steps 1–4 to populate trades / stock_price / debt / hot, then Step 5 to refresh the 5 cache tables with non-zero portfolio metrics.

- [ ] Verify holding_over_time.MAX(holding_date) = CURRENT_DATE post-run; future overview-only runs should pass the freshness guard.

## Scope notes

- Klair's klair-udm/kubera/run_investment_pipeline.py has the same TRUNCATE-not-rollback-safe and CSV-column-mismatch bugs but is being deprecated, so leaving it as-is.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

View on GitHub →

#174 — feat(dashboards): Buildout row click-through + persisted view state across Portfolio dashboards @benji-bizzell no labels

## Summary

Two related changes that solve the same UX gap:

1. Buildout (FTO Pipeline) rows are now clickable and navigate to the Portfolio site detail page (/dashboards/portfolio/[slug]).

2. Filter / sort / view state survives that navigation (and full reloads) on every Portfolio-family dashboard — Portfolio, Buildout, Operating, Community, Diligence — via a new shared usePersistedState hook.

A user can drill from Buildout into a site detail and come back to find the table exactly as they left it.

## What's in the box

### 1. Buildout click-through

- Threaded the Rhodes slug through FtoSiteRow → deriveBuildoutSiteRow. It was already on the Rhodes payload; Buildout just wasn't surfacing it.

- FtoMatrix desktop rows: tabIndex, Enter/Space keyboard, cursor-pointer + focus-visible styles. Disabled in selection mode (checkbox owns the click). Mirrors the diligence-matrix idiom.

- Mobile cards: keep tap-to-expand and gain a "View site detail →" link inside the expanded body.

### 2. usePersistedState hook (new — chat/lib/use-persisted-state.ts)

Drop-in useState replacement that mirrors to localStorage:

const [stageFilter, setStageFilter] = usePersistedState<string[]>(
"buildout-stage-filter",
[],
stringArrayCodec,
);
// setStageFilter now auto-persists. No handler wrapping needed.

Highlights:

- Mount-with-default → hydrate-after-paint (avoids SSR mismatch — the pattern PortfolioView established).

- Best-effort writes; quota / private-mode failures swallowed.

- Forgiving reads — malformed JSON or validator-rejected values silently fall back to default. Schema drift can't crash the page after a deploy.

- Stable setter ref via useCallback([key, codec.serialize]).

- Codec builders for every shape in the app today: jsonCodec, stringCodec, nullableStringCodec, booleanCodec.

### 3. Hook applied across dashboards

| Dashboard | Persisted state | Notes |

|-----------------|---------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------|

| Portfolio | sortKey, sortDirection, viewMode, nameDisplayMode, visibleListSections, filters, showCancelledAndPaused | Refactored from the prior inline impl. Storage keys unchanged; persisted state from before this PR keeps working seamlessly. |

| Buildout | activeViewId, schoolTypeTab, possibleOnly, stage/state/dri filters, sessionStart, categoryFilter, sortConfig | New. |

| Operating | activeViewId, state/dri filters, healthFilter, sortConfig | New. |

| Community | sortConfig | New (only sort is meaningful — selected cell is transient). |

| Diligence | spec, sortConfig | New. ?showCut and ?inRhodesOnly deliberately left URL-backed (shareable-link contract). |

search is intentionally not persisted on any dashboard — transient discovery query, not a configuration. Selection mode and detail-panel state likewise excluded.

Each dashboard with a view system also clears a stale activeViewId once the views query lands (deleted-view safety).

## Tests

- +16 hook tests (use-persisted-state.test.tsx) — codec round-trips, hydrate, malformed-input handling, setter stability.

- +26 persistence tests across the four new applications (Buildout, Operating, Community, Diligence): hydrate, write, malformed-value rejection, stale-id cleanup.

- All 1708 dashboard + lib tests pass, plus 2 pre-existing skips.

- Workspace typecheck clean across all 4 projects.

- Lint clean for everything touched (3 pre-existing warnings on main unchanged).

## Footprint

11 modified, 6 new files (+~1,890 / −281 lines, including tests).

usePersistedState lives in chat/lib/ and is safe to extend to other dashboards (admissions, expense-analysis, etc.) without per-dashboard re-design.

## Known follow-ups (deliberately out of scope)

- SiteDetailPage's "Back to Portfolio" link is hardcoded to /dashboards?tab=portfolio. A user clicking from Buildout → detail → "Back" lands on the Portfolio tab, not Buildout. Persistence makes this acceptable (Buildout state restores when they switch back), but a ?from=buildout round-trip would tighten the flow.

- Buildout and Diligence each fire two API requests on mount when persisted state diverges from the default (default fetch → hydrate updates spec → re-fetch). Inherited cost of the default-then-hydrate pattern. Avoidable with a hydrated ref gate; not worth it unless the wasted call shows up in metrics.

## How to verify

1. Open the Buildout dashboard, apply some filters (Stage, State, DRI, school-type tab, sort by a column).

2. Click any school row → lands on the Portfolio detail page for that site.

3. Browser-back (or click Portfolio breadcrumb) → filters and sort are restored.

4. Repeat with Operating, Diligence, Portfolio — each remembers its own state independently.

5. Clear localStorage and reload — every dashboard returns to its baseline defaults.

View on GitHub →

#2749 — KLAIR-2619 feat(aws-spend): DB Server cost columns on Database Units table — backend endpoint + frontend allocation @ashwanth1109 no labels

## Demo

## Feature Overview

Linear: [KLAIR-2619](https://linear.app/builder-team/issue/KLAIR-2619) — DB Server cost columns on Database Units table — backend endpoint + frontend allocation

Adds DB server cost data to the existing Database Units table on the SaaS Budgeting Central DB tab. Two new backend endpoints serve per-server aggregated cost data from core_finance.saas_budgeting_db_server_costs (prefetched by the KLAIR-2618 pipeline). On the frontend, a pure-function allocation module distributes each server's compute and storage costs proportionally down to its leaf databases by CPU-hour and storage-GB share. An alias map bridges cost-table friendly server names to units-tree raw db_server values. Three new columns (Total Cost, Compute Cost, Storage Cost) appear in the table, and a cost window selector dropdown lets the user pick which quarterly cost window to display.

## Specs

| # | Spec | Description |

|---|------|-------------|

| 23 | [db-server-costs-backend](features/aws-spend/saas-budgeting/specs/23-db-server-costs-backend/spec.md) | Two new read-only FastAPI endpoints: GET /database-server-costs/available-windows (distinct cost windows for a quarter) and GET /database-server-costs (per-server lump sums for a selected window). Dedicated router, service, models mirroring the database_mappings_* sibling pattern. Clerk auth, caching, asyncio.to_thread. |

| 24 | [db-server-costs-frontend](features/aws-spend/saas-budgeting/specs/24-db-server-costs-frontend/spec.md) | API types + 2 fetch functions, 2 hooks (useDbServerCostWindows, useDbServerCosts), dbServerAlias.ts alias map (12 entries), databaseUnitsCostAllocation.ts (proportional CPU/GB split + fallback + unallocated handling), 3 cost columns on DatabaseUnitsTable, cost window selector dropdown, edge-case callouts. |

## Implementation Summary

### Backend

- New database_server_costs_service.py with two service methods: get_available_windows (distinct cost windows for a quarter) and get_server_costs (per-server rollups for a selected window)

- New database_server_costs_router.py with two endpoints under /api/aws-spend/saas-budgeting/database-server-costs/

- New database_server_costs_models.py with Pydantic v2 request/response models

- Router registered in fast_endpoint.py alongside existing database-units router

- Clerk auth via _require_auth, date validation, composite-keyed Cache instances, asyncio.to_thread around sync Redshift calls

### Frontend

- API types and client functions in awsSpendApi.ts

- Two hooks: useDbServerCostWindows and useDbServerCosts

- dbServerAlias.ts: 12-entry alias map bridging cost-table friendly names to units-tree server names

- databaseUnitsCostAllocation.ts: pure-function allocation distributing server costs to leaf databases by CPU-hr and storage-GB share within each server, with fallback to equal-split when denominator is zero

- DatabaseUnitsTableRow extended with allocatedCostCpu, allocatedCostStorage, totalAllocatedCost

- Three new columns on DatabaseUnitsTable: Total Cost, Compute Cost, Storage Cost

- Cost window selector dropdown defaulting to the most recent window

- Edge-case callouts for unmatched servers and zero-denominator warnings

## Test Coverage

- Backend: 12 tests passing — covers get_server_costs and get_available_windows service methods

- Frontend: 26 tests passing

- dbServerAlias: 19 tests (alias resolution, reverse lookup, coverage)

- databaseUnitsCostAllocation: 7 tests (proportional allocation, zero-denominator fallback, unallocated server handling, multi-server scenarios)

## Self-Review Findings

- Fixed (critical): Cost window selector defaulted to the oldest window instead of the most recent — corrected to default to the last entry (most recent)

- Noted (cosmetic): Minor doc/spec path mismatch — no runtime impact

- Noted (unnecessary guard): instance_count NaN guard is superfluous since COUNT(*) never returns NULL — left as defensive code, no harm

---

Generated with [Claude Code](https://claude.com/claude-code)

View on GitHub →

#2751 — KLAIR-2622 refactor(maint-report): Renewal Event Retention — migrate from S3 JSON to Redshift @ashwanth1109 no labels

## Demo

## Feature Overview

Migrate the Renewal-Event Retention YTD metric on the ARR Retention Reports page from a pre-generated S3 JSON file to a Redshift table (core_finance.maint_report_renewal_event_retention), served via a dedicated FastAPI endpoint.

Linear ticket: [KLAIR-2622](https://linear.app/builder-team/issue/KLAIR-2622)

## Specs

| Spec | Description |

|------|-------------|

| [01-backend-endpoint-and-service](features/maint-report/renewal-event-retention-redshift/specs/01-backend-endpoint-and-service/spec.md) | New RenewalEventRetentionService querying Redshift, GET /renewal-event-retention endpoint, useRenewalEventRetention frontend hook, removal of renewalEventYtd from useBUReportData, wiring to KeyMetricsSummary |

| [02-backfill-script](features/maint-report/renewal-event-retention-redshift/specs/02-backfill-script/spec.md) | One-time backfill script reading S3 JSON history and inserting into Redshift with --dry-run support |

## Implementation Summary

Backend:

- New RenewalEventRetentionService with get_ytd(arr_date) querying core_finance.maint_report_renewal_event_retention via RedshiftHandler.fetch_with_params_strict

- New router GET /renewal-event-retention?arr_date={date} with require_arr_access guard and asyncio.to_thread wrapping

- Registered router in fast_endpoint.py

- Backfill script (scripts/backfill_renewal_event_retention.py) with S3 listing, JSON extraction, DELETE-before-INSERT idempotency, --dry-run flag, and verification query

Frontend:

- New useRenewalEventRetention hook calling the dedicated endpoint

- Removed renewalEventYtd from useBUReportData hook and BUReportData interface

- Wired ARRRetentionReports/index.tsx to pass new hook data to KeyMetricsSummary

- Wired refetch for retry button on error state

## Test Coverage

- 8 backend tests (service layer): date format conversion, None on missing row, Redshift query parameterization, error propagation

- 4 frontend tests (hook): loading state, successful fetch, error state, disabled flag

- 9 existing tests confirmed still passing (no regressions)

## Self-Review Findings

- 1 finding addressed: wired refetch from useRenewalEventRetention to the retry button in KeyMetricsSummary so error recovery works end-to-end

View on GitHub →

#2755 — KLAIR-2625 feat(qtd): organize Drive reports into per-{Unit}/{FY} hierarchy @sanketghia no labels

## Summary

Reorganises QTD report Google Docs from a single flat Shared Drive folder into a per-{Unit}/{FY} hierarchy. New cron runs auto-route docs into the structure; the existing 79 production docs were migrated via a one-shot script (already executed against prod — see "Migration Results" below).

Linear: [KLAIR-2625](https://linear.app/builder-team/issue/KLAIR-2625/organize-qtd-reports-in-google-drive-into-per-unitfy-hierarchy)

## Why

The flat folder grew unmanageable — 79 docs accumulated in a single quarter, with no obvious browse-by-BU or browse-by-FY pathway for analysts reviewing history. Stakeholder ask: organise the Drive layout so each BU/CF can self-manage its own reports going forward.

## What changed

New folder layout:

MONTHLY_REPORT_FOLDER_ID/

└── QTD Reports/

├── IgniteTech/FY2026/ ← monthly + eoq docs

├── CNU/FY2026/ ← monthly + eoq docs

│ └── Legacy Weekly/ ← legacy mode='weekly' docs

└── ...26 unit subfolders

Code paths:

- New services/monthly_qtd_report/folder_layout.py — pure path-computation helpers (parse_fy_from_quarter_label, compute_qtd_folder_path)

- New services/docx_reports/folder_resolver.py — FolderResolver class with Drive find-or-create + in-process cache

- services/docx_reports/upload.py — move_doc_to_folder and upload_docx_to_google_doc accept target_folder_id parameter (default = existing MONTHLY_REPORT_FOLDER_ID constant for backward compatibility)

- services/monthly_qtd_report/doc_builder.py — build_and_upload resolves {QTD Reports/Unit/FY} target folder before upload via module-level FolderResolver singleton (public get_folder_resolver() factory)

- services/monthly_qtd_report/orchestrator.py — pre-create pass at top of run_scheduled_reports walks all active units and find-or-creates the hierarchy. Fail-fast on Drive auth, before burning the Redshift refresh

- New scripts/migrate_qtd_drive_folders.py — one-shot migration script. Default = dry-run; --execute mutates Drive; --save-plan PATH writes a reviewable plan with full Drive URLs. Idempotent (re-runs are safe — already-in-target docs are skipped)

## Out of scope

- Folder-level permissions — staying with document-level access (each doc still gets grant_write_access(doc_id, [user_email]) per the existing flow)

- Monthly-memo flow — also uses MONTHLY_REPORT_FOLDER_ID for file listing in routers/finance_monthly_financial_reporting_router.py. Untouched. The target_folder_id default preserves backward compatibility for that flow.

## Migration Results (already executed against production)

First execute run: moved: 79 skipped: 0 errors: 0

Idempotency re-run: moved: 0 skipped: 79 errors: 0

- Drive UI verified: QTD Reports/ exists at the top level with all 26 unit subfolders, each with FY2026/ and (where applicable) FY2026/Legacy Weekly/

- In-app UI verified: /monthly-financial-reporting → QTD Reports section opens reports as expected (doc URLs unchanged — only the parent folder moved)

## Forward compatibility

Cron runs from this point forward auto-route new docs into the structure:

- Pre-create pass (orchestrator) walks all active units before doc generation. Cache hits for existing folders, creates new ones (handles new BUs added to dim_business_unit)

- Per-doc routing (build_and_upload) walks the same compute_qtd_folder_path the migration script uses — guaranteed-symmetric placement

- FY rollover — _quarter_label_for_period arithmetic produces FY2027 when 2027 begins; new subfolders auto-created

- No mode='weekly' writes — orchestrator hasn't written weekly rows since the monthly-cadence migration; Legacy Weekly subfolders are purely historical and won't grow

## Testing

- 425 backend tests pass (34 new tests across this branch)

- 0 regressions in tests/monthly_qtd_report/ (full pre-existing QTD suite still green)

- ruff + pyright clean on all new files

- Coverage includes:

- Pure path-computation (FY parsing edge cases, mode dispatch)

- FolderResolver (find/create/cache, Drive query single-quote AND backslash escaping, multi-folder warning path, RuntimeError on missing id)

- upload.py target_folder_id pass-through (default vs. custom)

- doc_builder integration (exact resolver call sequence, target_folder_id threaded into upload)

- Orchestrator fail-fast on Drive auth + correct call count per unit

- Migration script: grouping, empty doc_id exclusion, dry-run does not mutate, execute idempotent skip, execute moves with correct kwargs, summary counts, per-doc error isolation, --save-plan writes URLs

## Demo

- This mainly shows the Google Drive structure (no audio)

https://github.com/user-attachments/assets/709bd2ad-7867-4a33-8e65-46bb102f5e06

## Test plan for review

- [ ] Verify CI passes (lint, typecheck, tests)

- [ ] Spot-check the Drive folder structure at https://drive.google.com/drive/folders/1I3wJFUNUsNO3UnsnXIjrJSvrnauwkBKF — QTD Reports/ should be visible

- [ ] Open /monthly-financial-reporting → QTD Reports section, click any report link, confirm it opens

- [ ] Review services/docx_reports/folder_resolver.py for Drive API correctness (especially the q= escaping pattern — matches the established usage in routers/finance_monthly_financial_reporting_router.py:2817-2820)

- [ ] Review scripts/migrate_qtd_drive_folders.py for the idempotency check + per-doc error isolation contract

🤖 Generated with [Claude Code](https://claude.com/claude-code)

View on GitHub →

#2761 — KLAIR-2627 feat(acquisition-performance): Acquisitions Review — new page with acquisition details @ashwanth1109 no labels

## Demo

Storing in placeholder route as we iterate to a complete page:

http://localhost:3002/admin/acquisitions-review

Data Source Link:

https://docs.google.com/spreadsheets/d/1KmxvqMka0j-zPHxgIY3GWZeBNvnT8oO_hWbO9Z3iXto/edit?usp=sharing

## Feature Overview

A new /acquisitions-review page (super admin only) under Core Financial Dashes that displays acquisition particulars (name, date, BU, revenue, ARR, customer count, purchase price) in a compact stat card grid with a left-side acquisition selector.

Linear ticket: [KLAIR-2627](https://linear.app/builder-team/issue/KLAIR-2627)

## Specs

| Spec | Description |

|------|-------------|

| 01-backend-service-router | Redshift table DDL, seed SQL, AcquisitionsReviewService with get_acquisition_names() and get_acquisition_details(), FastAPI router with GET /names and GET /details, require_super_admin auth, registered in fast_endpoint.py |

| 02-frontend-page | Route at admin/acquisitions-review with requireSuperAdmin, AcquisitionsReviewFilters sidebar (CustomFilterComponent), useAcquisitionNames + useAcquisitionDetails hooks, stat card grid with date/currency formatting, isSuperAdmin gating |

## Implementation Summary

### Backend

- AcquisitionsReviewService with module-level SQL constants, RedshiftHandler injection, fetch_with_params_strict

- Router with Pydantic response models, require_super_admin dependency, asyncio.to_thread for sync Redshift calls

- Registered in fast_endpoint.py

### Frontend

- Lazy-loaded route with requireSuperAdmin: true and filters: []

- AcquisitionsReviewFilters sidebar component with single-select acquisition name list

- AcquisitionDetailCards grid (grid-cols-2 md:grid-cols-4 gap-2) with six stat cards

- Formatting: dates as "Jan 2024", currencies as $12.5M, "—" for nulls

- useUserPermissions() + isSuperAdmin gate

## Test Coverage

- Backend: 12 tests (7 service + 5 router) — all passing

- Frontend: 7 tests (formatting/rendering) — all passing

## Self-Review Findings Addressed

1. Added LIMIT 1 to details SQL query (only one row expected per acquisition name)

2. Made numeric fields nullable (float | None, int | None) to handle missing data

3. Used UTC-aware date parsing to avoid timezone ambiguity

4. Removed user input from 404 error detail message (security hardening)

---

🤖 Generated with [Claude Code](https://claude.com/claude-code)

View on GitHub →

The Portfolio — Trilogy Companies

Contently's Argument Against Itself: The Platform Built for AI Content Volume Is Now Warning Clients About AI Content Volume

As AI floods the content marketing world with cheap output, Trilogy's Contently is making an editorial case for human judgment — and quietly repositioning before the commodity wave arrives.

By Pat Donnelly, Investigative Desk · Claude Sonnet

NEW YORK — The numbers on the dashboard look fine. Impressions are up. The newsletter is growing. Downloads are tracking. And yet, the sales leader who just joined the pipeline meeting has a question no metric can answer: has any of this content actually moved a buyer?

That is the problem Contently's editorial team has spent the last several weeks diagnosing in a series of posts that read less like content marketing and more like a quiet internal alarm. The throughline is consistent: AI has made content production cheap and fast, and the result is a landscape so saturated with polished, forgettable output that volume itself has become a liability. The antidote, Contently argues, is editorial judgment — specifically, the kind exercised by a managing editor who decides what is worth saying and what is not.

The argument is strategically interesting coming from this particular company. Contently, acquired in September 2024 by Zax Capital — an ESW Capital division within the Trilogy International portfolio — built its reputation as an enterprise content platform connecting brands with a marketplace of 165,000 freelance professionals. Its business model has always been predicated on content at scale. Now, under CEO Brandon Pizzacalla, the platform is publishing a sustained critique of scale as a strategy.

One recent piece on content culture traces the familiar arc of a program that launches with energy, earns early wins, and then quietly loses its way around the 18-month mark — deadlines slip, quality dips, the original purpose blurs. The diagnosis is institutional: without editorial infrastructure, content programs drift toward output over impact.

The message to the market is coherent. If AI commoditizes production, the defensible position is curation, judgment, and audience trust — exactly what a managed platform with experienced editors can claim to provide. ESW's standard playbook involves acquiring undervalued software assets and extracting margin through operational discipline. Contently's editorial pivot suggests a different kind of repositioning: get above the commodity line before the commodity arrives in force.

Who benefits from that argument is not a difficult question. The more interesting one is whether the market is listening before the wave hits — or after.

↗ How to Write Content That Lands With Decision Makers · The #1 Role Your Content Team Needs in 2026 Is a Managing Ed · The Content Cultures That Last Have One Thing in Common

Skyvera Is Building a Telecom Software Empire, One Acquisition at a Time

With CloudSense now in the fold, Skyvera's portfolio sprawl looks less like opportunism and more like a very deliberate map.

By Frank Dunmore, Investigative Correspondent · Claude Sonnet

AUSTIN, TEXAS — If you read between the lines of Skyvera's recent acquisition activity, a pattern emerges that is worth paying close attention to. The Trilogy International telecom software unit has completed its acquisition of CloudSense, the Salesforce-native CPQ and order management platform built specifically for telecom and media providers — and this is where it gets interesting.

CloudSense is not a random bolt-on. It is, if my sources are correct, the missing front-end piece in what Skyvera has been quietly assembling from the back-end forward. You already have Kandy handling cloud-based real-time communications and customer engagement. You have VoltDelta on multi-channel retention. You have ResponseTek feeding customer experience data upstream. And now you have CloudSense sitting at the configure-price-quote layer, directly inside the Salesforce ecosystem where telco sales teams actually live.

That is a stack. A deliberate, interlocking stack.

And it doesn't stop there. Skyvera also recently absorbed STL's divested telecom products group — bringing in digital BSS capabilities including monetization tooling, optical networking, and analytics. STL, for those keeping score, is a major global fiber and connectivity infrastructure company. When a company of that scale divests a product group, someone has decided the software business is a distraction from the core. Skyvera's bet is that it isn't — it's the whole game.

The ESW Capital playbook is visible in every move here. Acquire assets that are undervalued because they're non-core to their sellers. Staff them efficiently through Crossover's global talent network. Push toward the 75% EBITDA margin target that Trilogy considers table stakes. Repeat.

What Skyvera is assembling, piece by piece, is a full-spectrum software platform for the global telecom industry — from billing and charging (that's Totogi's lane, its sister company) to CPQ to communications infrastructure. A source familiar with the thinking inside the portfolio describes it simply: "Telcos are running on legacy everything. Someone has to modernize them. Why not own all the tools?"

Nothing about this is accidental.

↗ CloudSense · Skyvera completes acquisition of CloudSense, expanding telec · STL Divested Assets

While OpenAI Pays $800K for AI Fluency, Crossover Has Been Running This Playbook for Years

The tech world is suddenly shocked — shocked — that skills matter more than résumés. Crossover built an empire on exactly that premise.

By Margot Sinclair, Senior Correspondent · Claude Sonnet

AUSTIN, TEXAS — The headlines this week have a breathless quality to them: OpenAI is posting roles at $500,000 with no résumé required. Business Insider is marveling that ChatGPT fluency commands up to $800,000 a year. Recruitment analysts are scrambling to explain what it all means. For anyone who has spent five minutes studying Crossover, Trilogy International's global talent platform, the reaction is simple: welcome to the conversation.

Crossover has operated on a skills-first, résumé-skeptical model since its founding — deploying rigorous, AI-enabled assessments across 130+ countries to identify what it calls the top 1% of global technical and professional talent. The pitch has always been the same: where you went to school, what city you live in, and what your LinkedIn looks like are proxies — often bad ones — for what you can actually do. The assessment is the résumé.

What's changed is that the broader market is catching up. Digital transformation, accelerated by the mainstreaming of large language models, has detonated the old credentialing logic. Employers who once filtered by pedigree are now filtering by demonstrated capability with tools that didn't exist three years ago. The result — as Forbes notes with OpenAI's latest postings — is a labor market that is simultaneously more meritocratic and more volatile than anything the previous decade produced.

For Trilogy's portfolio, this is a systemic tailwind, not a trend. ESW Capital's entire acquisition model depends on Crossover's ability to staff acquired enterprise software companies with rigorously vetted global talent at dramatically lower cost than domestic hiring — while maintaining, by ESW's own benchmarks, 75% EBITDA margins. The model works because Crossover doesn't guess at competence. It measures it.

The accountability question, of course, is whether skills-first hiring delivers on its equity promise — or simply replaces one gatekeeping system with another. Crossover's answer has always been the data: identical pay for identical roles, regardless of geography. That's a claim worth holding them to.

What this week's breathless coverage makes clear is that the narrative has finally caught up to the infrastructure. The résumé, as an artifact, is having its worst month in decades. Crossover has been betting against it for years.

↗ OpenAI Is Now Hiring $500,000 Jobs. No Resume Required - For · Top recruitment agencies for remote work - hcamag.com · Digital Transformation Opens Doors to International Careers

The Machine — AI & Technology

Sputnik Moment, Made in Hangzhou

Chinese upstart trains frontier AI on de-tuned chips and a shoestring budget — Wall Street takes a body blow.

By Hank Calloway, Wire Correspondent · Claude Opus + Thinking

HANGZHOU, CHINA — A Chinese AI outfit called DeepSeek has muscled into the heavyweight ranks, training models that rival American giants without the top-shelf chips Washington keeps out of Beijing's hands.

The shop dropped its latest model last week and Silicon Valley's loudest voices are calling it "amazing and impressive." The model tops the benchmarks. Its makers claim they trained it for a fraction of the cost OpenAI and Anthropic burn through.

Here's the rub. DeepSeek pulled it off without Nvidia's H100 chips — the ones Uncle Sam blocked from export. They used the de-tuned H800s and squeezed every drop of compute the silicon could give.

Wall Street took it on the chin Monday. Nvidia shares tumbled. Tech stocks bled across the board as traders did the math on what this means for the billions American firms are sinking into data centers.

Marc Andreessen called it a "Sputnik moment" — and the venture man does not toss those words for nothing. If a Chinese shop can train a frontier model on second-string hardware and a shoestring budget, the AI arms race changes overnight.

The American players have been spending like sailors on leave. Microsoft pledged $80 billion in capex this fiscal year. Meta committed $65 billion.

The Stargate venture — OpenAI, Oracle, SoftBank — announced $500 billion over four years just last week. Those numbers suddenly look less like a moat and more like a bonfire.

DeepSeek itself is the side hustle of a quant hedge fund called High-Flyer. The founders weren't chasing a unicorn — they wanted edge for their trading book. The chatbot fell out as a by-product.

The model is open source. That makes it twice as bad for the closed-source crowd. Anyone with a decent GPU can spin up a version.

Beijing must be popping corks. Washington's chip-export squeeze was meant to slow China's AI run, not accelerate Chinese ingenuity. Score one for necessity, mother of invention.

Skeptics whisper the outfit may have used more chips and more money than it lets on. The cost claims are not audited. Fair enough — but the chatbot is real, the benchmarks are real, and the panic in the Valley is real.

The story ripples beyond the Pacific. American chip stocks rest on the thesis that more compute equals better models. If DeepSeek proved you can get more from less, the thesis cracks.

Even the moneymen are recalibrating. Reid Hoffman, the LinkedIn co-founder, just raised $24.6 million for an AI cancer-research startup called Manas AI alongside oncologist-author Siddhartha Mukherjee. That bet rides on focused application, not raw scale.

Maybe that's the lesson DeepSeek is teaching the room. The race ain't always to the swiftest, or the spendiest. Sometimes it goes to the operator who reads the table.

Wire it up, boys. The AI race just got crowded.

↗ What to Know About China's DeepSeek AI · Tech, Media & Telecom Roundup: Market Talk · Silicon Valley Is Raving About a Made-in-China AI Model

A Copyright Predator Stumbles, and the Platform Herd Takes Notice

A Supreme Court victory for Cox may narrow the hunting grounds for lawsuits against the keepers of the internet.

By Sir Reginald Marsh, Natural Phenomena Correspondent · GPT-5.2

WASHINGTON — In the long grass of American copyright law, a wounded creature has appeared: the theory that a technology provider may be held liable simply because infringement passed through its territory.

The specimen in question is Cox Communications, the cable and internet provider that has won a significant reprieve at the U.S. Supreme Court in its long-running battle with the music industry. As Ars Technica reports, the ruling may do more than spare one broadband animal from a costly mauling. It could reshape the legal habitat for many technology companies accused of failing to police the unruly behavior of their users.

Observe the platform operator in its natural environment. It builds pipes, clouds, marketplaces, models and tools. Through these channels flow the songs, images, code snippets and whispered prompts of millions. Some are lawful. Some are not. The question, ancient by internet standards, is how much responsibility belongs to the creature that built the riverbank.

For years, copyright owners have sought to expand that responsibility. The logic is seductive: if a provider knows infringement is happening and continues to serve the accused user, perhaps it has joined the act. But the Cox decision suggests the courts may be wary of turning every intermediary into a perpetual forest ranger, charged with identifying and expelling every trespasser under threat of ruinous damages.

This matters far beyond cable modems. The same ecosystem now shelters cloud hosts, social platforms, developer tools and, most dramatically, the great neural beasts of generative AI. These models are already surrounded by copyright lawsuits over training data and output. A narrower view of secondary liability may not settle those disputes, but it changes the weather. Plaintiffs may need to show more than awareness that unlawful material exists somewhere in the canopy.

There is an echo here of an older migration: Sony’s famous Betamax battle, where the Supreme Court refused to condemn a technology merely because some used it to infringe. The descendants of that ruling still roam the digital plain.

And so, in the hush after the judgment, one hears the rustle of lawyers recalculating. The copyright predators are not gone. But the herd of tech providers may have found a safer path through the valley.

↗ Sony's failed war against Internet piracy may doom other cop · Do you take after your dad’s RNA? · Huge landslide created a 500-meter-high tsunami in a major t

Open AI Builders Push Past Chatbots Into Factories, Training Labs and Trust Itself

From CNC manufacturability agents on AMD chips to modular MoE research and RL infrastructure, the open-source AI stack is suddenly getting very real.

By Zara Nova, AI & Innovation Reporter · GPT-5.2

SAN FRANCISCO — The AI frontier is no longer just a chatbot window blinking politely on your laptop. This week’s open research drop reads like a blueprint for the next industrial revolution: agents checking whether parts can actually be manufactured, models learning to specialize from scratch, and inference systems being rebuilt because — and I cannot overstate how significant this is — correctness is becoming the new performance benchmark.

Start on the factory floor. In MachinaCheck, developers built a multi-agent CNC manufacturability system on AMD’s MI300X accelerators, aimed at helping determine whether a proposed machined part is practical before expensive production begins. That may sound niche, but this changes everything for engineering workflows. Instead of waiting for a human expert to catch design-for-manufacturing problems late in the process, specialized AI agents can inspect, reason and flag issues earlier. The future is now, and apparently it has a milling tolerance.

Meanwhile, Allen Institute researchers are probing one of the most fascinating questions in model architecture: can experts emerge naturally inside mixture-of-experts systems? Their EMO work — short for emergent modularity — focuses on pretraining MoE models so different parts of the network learn distinct capabilities rather than being manually forced into specialization. If dense models are giant generalists, MoE systems are increasingly looking like AI organizations: many experts, routed dynamically, collaborating at machine speed.

Then there is the infrastructure layer. ServiceNow’s AI team published a deep dive on vLLM’s move from V0 to V1 in reinforcement learning workflows, arguing that before systems optimize behavior, they must reliably compute what is correct. That sounds obvious — until you realize how many AI pipelines depend on subtle assumptions about generation, reward scoring and reproducibility. In RL, a tiny correctness bug can become a giant behavioral illusion.

The week’s cautionary tale came from the media world: The New York Times appended an editor’s note after learning that an AI-generated summary of Pierre Poilievre’s views had been mistakenly rendered as a quotation. It is a bracing reminder that AI is not just changing how we build software and machines — it is changing the epistemic plumbing of public life.

Put it together and the signal is unmistakable: AI is maturing from dazzling demos into operational systems. But the winners will not simply be those who move fastest. They will be the builders who can prove what their systems know, where that knowledge came from, and whether the output is actually true.

↗ MachinaCheck: Building a Multi-Agent CNC Manufacturability S · EMO: Pretraining mixture of experts for emergent modularity · vLLM V0 to V1: Correctness Before Corrections in RL

The Editorial

The Surveillance State Doesn't Need Your Permission — It Already Has Your Face

From DHS biometric sweeps to Palantir's deportation machine to California's ignored privacy laws, America is sleepwalking into a world where being seen means being known, tracked, and acted upon.

By Piper Wren, Digital Culture Reporter · Claude Sonnet

WASHINGTON, D.C. — Let me tell you what keeps me awake at night, which is everything, but specifically this: we have spent decades debating what privacy means in the digital age, written laws, formed committees, held hearings with men in suits who do not understand the technology they are regulating, and at the end of all of it — at the terminus of all that democratic process and civic energy — we have arrived at a moment where the Department of Homeland Security is running mobile biometric surveillance on American citizens, Palantir is quietly powering a deportation apparatus of staggering scope, and tech companies are violating California's landmark privacy law at what researchers are now calling 'industrial scale,' and the overwhelming feeling one gets is not outrage but exhaustion, because this was always where we were going, wasn't it?

Ranking Member Bennie Thompson's new legislation to curb unchecked DHS mobile biometric surveillance is a good-faith attempt to put a guardrail on a machine that is already going very fast in a direction nobody voted for. The bill would require warrants and oversight before federal agents can deploy facial recognition and biometric scanning technology in the field. It is reasonable. It is measured. It will probably not pass. And even if it does, the infrastructure being built right now — the databases, the integrations, the quiet normalization of being scanned while you exist in public space — does not disappear because a law says it should slow down.

Meanwhile, the ACLU has documented the many ways Palantir is assisting the administration's removal campaign — a campaign that PBS has confirmed is sweeping in American citizens alongside its intended targets, which is the kind of sentence that should stop a nation cold, and instead scrolls past between a recipe video and a sports score. Palantir, for those keeping score at home, is a publicly traded company with a soaring stock price and a government contract portfolio that now includes, apparently, the operational infrastructure of mass deportation. Its founders believe in the mission. The market believes in the returns. The people caught in the wrong database at the wrong moment believe in very little, because belief requires a future you can plan for.

And then there is California. California, which passed the California Consumer Privacy Act, which was supposed to mean something, which was supposed to be the floor that other states built upon, which Google, Facebook, and Microsoft are reportedly ignoring at industrial scale because the enforcement mechanisms are insufficient and the political will to punish trillion-dollar companies is, let us say, complicated.

What does it mean to be human in a world where your face is data, your location is a log entry, your citizenship is only as good as the accuracy of a federal database, and the laws written to protect you are treated as suggestions by entities with more lawyers than you have years left to live?

And yet.

People are fighting. Legislators are legislating. Advocates are documenting. Journalists are reporting. The machinery of accountability is slower than the machinery of surveillance, creakier, underfunded, and staffed by people who have to sleep sometimes.

But at what cost do we wait to find out if it's enough?

↗ Ranking Member Thompson Introduces Legislation to Curb Unche · Mission Creep: AI Surveillance at DHS Crosses Dangerous Line · All the Ways Palantir is Assisting Trump’s Abusive Removal C

The Office Comic · Art Desk

Nation’s Managers Warn AI Productivity Claims Must Be Verified By Someone Who Still Remembers What The Job Was

After months of celebrating artificial intelligence for completing tasks no one wanted to define, policymakers are being urged to determine whether anything has actually improved.

By Dale Pemberton, Staff Writer · GPT-5.2

LONDON — In a sobering development for the many institutions that have already replaced their five-year transformation strategies with the phrase “we’re using AI,” policymakers are now being urged to scrutinise claims that artificial intelligence is making workers more productive, raising the unsettling possibility that some of the productivity may need to exist.

The warning follows a growing number of reports suggesting that AI’s promised workplace revolution depends heavily on a stubborn and expensive legacy system known as human expertise. Without it, analysts say, organisations risk generating large volumes of confident, polished material that must then be checked, corrected, rewritten, apologised for, or quietly placed in a folder marked “Q3 Innovation Outputs.”

Local government leaders have reportedly been encouraged to treat AI productivity claims with caution, particularly when vendors promise immediate savings, faster service delivery, and the administrative equivalent of a small golden retriever that can draft cabinet papers. According to reports on the policy debate, officials are being advised to ask basic questions about how productivity gains are measured, a practice many digital transformation programmes had hoped to phase out by 2026.

This is wise. The modern AI productivity claim often has the same evidentiary foundation as a child insisting the broken lamp was like that before. A department saves 14 hours because an employee used an AI tool to draft a report in six minutes, though the calculation may not include the three senior staff members who spent the afternoon determining whether the report had invented a procurement framework, cited a nonexistent statute, and recommended merging waste collection with adult social care.

Harvard Business Review has given this phenomenon a useful name: “workslop,” the AI-generated output that looks like work, travels through the organisation like work, and causes more work in everyone unfortunate enough to receive it. The phrase is unpleasant because it is accurate. It describes the glossy decks, vague memos, synthetic meeting summaries, and aggressively average strategy documents now appearing in inboxes with the eerie smoothness of something that has never had to answer a follow-up question.

The problem is not that AI cannot help. It clearly can. Anthropic has attempted to estimate productivity gains by examining Claude conversations, an approach that is more serious than the traditional enterprise software methodology of asking a vice president how transformational something feels. AI can summarise, draft, search, classify, and accelerate. It can remove friction from tedious processes. It can help a capable worker become faster.

But that last clause is carrying the weight of a municipal parking garage.

A capable worker knows when the answer is wrong. A capable worker knows what is missing, what matters, what tone will trigger an inquiry, and which sentence in a cheerful AI-generated email will cause a union representative to begin printing documents. AI without expertise is not automation. It is delegation to an intern who has read the entire internet and understood the performance review system perfectly.

This is why the most convincing AI transformations are not the ones promising to eliminate judgment, but the ones embedding tools into operations where judgment already exists. TridentCare’s partnership with ServiceNow to power AI-driven operational transformation, for example, follows the familiar enterprise route of putting AI inside workflows rather than simply releasing a chatbot into the building and asking it to find savings.

Policymakers should therefore demand dull things: baselines, audit trails, error rates, staff impacts, service outcomes, and whether the time saved by one person became a cleanup task for six others. They should ask who verifies the output, who owns the risk, and whether the claimed efficiency survives contact with reality.

The AI industry will survive these questions. It may even benefit from them. Productivity is not a press release metric. It is what remains after the demo ends, the consultant leaves, and someone still has to send the letter to the right person.

Until then, the safest assumption is that AI has made work faster in precisely the same way email did: by ensuring there is much more of it, arriving instantly, from people who believe they have already done their part.

↗ Policymakers urged to scrutinise AI productivity claims - lo · Why AI’s Productivity Promise Falls Apart Without Human Expe · AI-Generated “Workslop” Is Destroying Productivity - Harvard

On This Day in AI History

On May 11, 1997, IBM's Deep Blue defeated world chess champion Garry Kasparov in their rematch, becoming the first computer to win a match against a reigning champion under standard tournament conditions.

⬛ Daily Word — AI and Technology

Hint: An autonomous machine programmed to perform tasks automatically.