## Screenshots
<img width="1919" height="942" alt="image" src="https://github.com/user-attachments/assets/5be0f85d-c019-4b17-9425-2aa8a0fb8bff" />
## Summary
- Bumps Budget Bot 4.0 to Claude Opus 4.7 across every LLM call (board-doc generation, Coach Claire chat, brainlift QC) with a new thinking_kwargs(effort) helper that handles Opus 4.7's adaptive-thinking shape via extra_body and a TEMPERATURE_UNSUPPORTED_MODELS guard.
- Ships B7 Path A — whole-document context for Coach Claire so she can reason across sections (catch internal contradictions, verify cross-section number coherence, spot completeness gaps), plus B7.8 explicit "planning quarter" framing and the B0.8 / B0.9 / B1.7 clone-path polish that together make a freshly-cloned Skyvera Q2 session demo-ready (no empty H1 wrappers in the outline, no duplicate headings on regen, refresh banner actually fires).
- Ships B8 — section CRUD via editor + Claire tools end-to-end: POST / DELETE / PATCH /sections BE endpoints with save_with_merge_retry discipline, three new Coach Claire tools (add_section / remove_section / rename_section), and matching FE proposal handlers — closes the May 7 "can't add a GM Commentary section post-generation" workflow gap.
## Why it's needed
- Local testing of B7 surfaced a chain of clone-path bugs that made the demo flow incoherent — Claire reasoned about Q1 numbers as if they were current state, the doc body had two stacked "Prior Quarter Review" headings after every regen, the reload banner never fired on cloned sessions, and the editor's outline started with a confusing empty "Business Unit Plan" H1 wrapper.
- The model bump and the new prompt framing (B7.8) together made Claire materially smarter at cross-section reasoning during testing — she correctly caught a 68% vs 63% margin target inconsistency across MIPs, Goals, and the financial tables AND a $1.4M vs $0/$77K Q4'25 write-off mismatch between MIPs and the Hybrid Plan table without any explicit prompting. That's the kind of "this doc isn't internally consistent" feedback the David-demo target relies on.
- The May 7 testing also surfaced a structural gap: Claire couldn't propose adding a section the user wanted (e.g. a missing GM Commentary), because section structure was locked into the wizard's template-customisation step. B8 closes that — Claire's tool surface now covers structure, not just content.
## Changes
Model layer (Opus 4.7):
- Centralised BOARD_DOC_MODEL = "claude-opus-4-7" in models.py; replaced ~16 hardcoded claude-sonnet-4-20250514 strings to import the constant.
- New thinking_kwargs(effort) helper in models.py returning thinking={"type": "adaptive"} + extra_body={"output_config": {"effort": effort}}. Opus 4.7's adaptive-thinking shape isn't yet exposed as a typed SDK kwarg; extra_body is the documented escape hatch.
- Dropped temperature=0 from the four direct Anthropic call sites (Opus 4.7 deprecated the parameter). gpt_retry.py got TEMPERATURE_UNSUPPORTED_MODELS to omit the kwarg for Opus 4.7 in the structured-call path.
B7 Path A — whole-doc context:
- _full_doc_block(session, focused_section_id) concatenates every generated section into a <full_document> block; focused section excluded to avoid duplication; per-section truncation with [N additional sections omitted] markers + INFO logs. Caps: 80K total / 30K per section.
- M10 follow-up (review round 1): full-doc block moved from the system prompt to the latest user message via _compose_messages_with_full_doc so the static framing stays cacheable across chat turns. Real cost win on Opus 4.7 input pricing for typical docs; review-round-2 R2-M2 dropped the original specific dollar-figure claim in favour of a directional comment + tracking ticket B7.10 to measure cache_creation_input_tokens vs cache_read_input_tokens against a real prod chat-turn telemetry pass.
- Chat handler max_tokens 1024 → 4096.
Demo polish (B7.8 / B0.8 / B0.9 / B1.7):
- B7.8: rewrote prompt opening to explicit "helping the user plan Q{n} Y for {BU}" with a follow-on paragraph telling Claire that body content may carry over from the prior quarter.
- B0.8: create_from_prior_quarter renames the first empty H1 wrapper to {BU} Q{n} Y Plan; subsequent empty wrappers dropped. assemble_markdown matches the new title format. Review-round-1 fix #1 unified this format across publish_to_google_doc (Drive filename), create_from_prior_quarter (clone GDoc filename), and the .docx export endpoint — all four 4.0 sites now produce identical strings, pinned by test_assembler_title_format.py. Review-round-2 R2-H1: documented an explicit deferral at the two legacy 3.0 callsites (final_document_service.py × 3, budget_doc_generator.py × 1) that intentionally retain the older {BU} Budget Plan Q{n} Y / Budget Plan for {BU} - Q{n} Y format because they're separate product surfaces (Goal-MIPER + the older non-wizard generator) where unifying would either break an existing Drive lookup key or require coordinated migration with a different product owner. Tracking ticket B0.8b.
- B0.9: _strip_leading_duplicate_heading(markdown, title) post-processes generator output; wired into all three regenerate paths (typed, custom, exec-summary). Fuzzy match on title (case + punctuation insensitive).
- B1.7 path (a): _promote_section_type_from_title heuristic with regex patterns for canonical section titles. Review-round-1 M7 added a _USER_CUSTOMISATION_SUFFIX_RE block-list (Discussion, Notes, Status, Update, Deep Dive, etc.) so user-customised titles like "MIPs Discussion" stay CUSTOM rather than getting silently re-typed; pinned by test_promote_section_type_from_title.py. Review-round-2 R2-M3: split risks? out of the bare-match alternation into a multi-word-only sub-pattern so a future canonical "Risks" section can be honoured without the block-list silently demoting it; the trade-off (conservative on canonical false-NEGATIVES, aggressive on CUSTOM false-POSITIVES) is now documented explicitly in the regex's docstring.
Section-id visibility + chat polish (B3.19 / B3.20):
- _build_step_context section inventory shows id=... — "{title}" instead of just title; explicit "use the exact id" guidance.
- regenerate_section tool description rewritten: explicit "do NOT slugify" + "WORKFLOW: Accept kicks off the pipeline IMMEDIATELY, no second diff step."
- handle_chat three-branch fallback: text block → use verbatim; tool calls only → "Proposed an action above — review and accept when ready."; pathological → legacy "rephrase" message.
- _regenerate_section logs WARNING on unknown section_id with the full known-id list.
B8 — section CRUD:
- Three new endpoints (POST / DELETE / PATCH /sections) wrapping new orchestrator functions (add_section / remove_section / patch_section). Sparse-integer ordering (gap = 1000) via shared _resequence_tail_starting_at helper (review-round-1 #6 fix; the pre-fix add_section rebalance loop produced 2x the intended spacing because it added _SECTION_ORDER_GAP redundantly inside the body — patch_section's twin loop was already correct, helper now used by both). Cascade on delete drops generated_sections[id], section_edit_status[id], section_comments anchored to the removed id, plus (review-round-1 #7) data_refresh_updated_sections and user_commentary[section_id] (for the chat-feedback keying); type-promotion auto-fills required_data.
- Each endpoint runs the orchestrator inside save_with_merge_retry via a result-holder pattern that captures the orchestrator's return payload from inside the closure (review-round-1 #4 corrected the misleading "EXACTLY ONCE" docstring; the closure is allowed to re-run on ConcurrentModificationError, correctness comes from DDB conditional saves and the result-holder is a response-shaping mechanism, not a single-execution guarantee). Pinned by TestSectionCRUDRetryPath (review-round-1 #5: 3 tests stub storage.save to raise once and assert no duplicate / no 404 / no double-shift on the retry).
- claire_tools.py extended from 4 to 7 tools with matching Pydantic input validators + Anthropic wire schemas. add_section accepts both after_section_id and before_section_id; review-round-1 M13 made PatchSectionRequest symmetric (PATCH also accepts before_section_id so drag-to-top is a single primitive, not "after the section preceding the head"). M14 short-circuits empty PATCHes to skip the DDB write entirely. M15 tightens the orchestrator's changed signal to False for value-equivalent no-ops; review-round-2 R2-M1 surfaces this signal through SectionMutationResponse.changed (BE) → SectionMutationResponse.changed? (FE TS interface) → ChatToolProposal rename handler skipping onSectionStructureChanged when changed === false, so the M15 contract is end-to-end live rather than orchestrator-only. M16 logs a warning when a section_type / entity_type mismatch produces an empty required_data slate; review-round-2 R2-H2 added a sibling if section_type != CUSTOM: guard to patch_section (the round-1 fix only guarded add_section), preventing spurious operator-page warnings on CUSTOM transitions where the empty slate is the explicit user choice rather than a misconfiguration.
- FE: boardDocApi.ts API client wrappers (createSection / deleteSection / patchSection) + matching TypeScript types. ChatToolProposal.tsx handleAccept switch + ProposalBody switch each extended with three new variants. Destructive warning copy on remove_section proposal cards; cascade-cleared-comments toast on Accept. Review-round-1 #8 added a window.confirm gate on remove_section Accept (matches the existing comment-delete pattern; the proposal card's destructive warning copy was the only gate pre-fix). Review-round-2 R2-L1 routes the human-readable section title through DocumentEditorPage → ChatPanel → ChatToolProposal (reusing the existing sectionTitlesMap memo) so the destructive-confirm dialog AND the proposal-card body caption surface "Other Products" instead of minor_products_summary, matching the SectionNav outline + post-delete toast. Round-1 deferred FE Low #2 added role="alert" to the destructive-warning chip while the file was being touched. Review-round-1 M19 fixed a tautological stale-resolve guard in the auto-fetch + loadAll effects via a render-tracking currentSessionIdRef; review-round-2 R2-L2 moved the ref update into a no-deps useEffect (concurrent-mode-safe shape). Review-round-1 M20 surfaced refreshSession failures via an opt-in silent: false mode so structural changes that fail to refresh leave the user with a "click Reload" toast instead of a stale outline. M21 / M22 polished _AUTO_REGENERATE_SECTION_TYPES (renamed without leading underscore + moved out of the import block).
Out of scope (deferred): B8 manual SectionNav context menu + "+" button — the BE is public-shaped and ready, the Coach Claire flow is the demo path so the manual editor UI can ship in a follow-up PR without coordinated BE changes.
## Breaking changes
None at the contract level. Wire schemas / endpoints are all additive; existing 4-tool Claire surface untouched. SectionMutationResponse.changed (BE) and SectionMutationResponse.changed? (FE) are additive fields with backward-compatible defaults (BE defaults to True, FE TS interface marks it optional). Two soft-breaking implementation details worth flagging for any in-flight branches:
- BOARD_DOC_MODEL = "claude-opus-4-7" replaces the prior 4-6 default. Prod cost per chat turn goes up vs the previous Sonnet/Opus-4.6 mix; offset by adaptive thinking choosing budget per call AND by the M10 prompt-cache placement that keeps the static framing cacheable across chat turns.
- The Anthropic SDK error surface changed for legacy callers that still pass thinking={"type": "enabled", "budget_tokens": ...} against Opus 4.7 — gpt_retry.py's guard catches the structured-call path; direct callers should switch to thinking_kwargs(...).
## Test plan
### Reviewer demo path (the four prompts that locked the May-7 build green)
Start a fresh Skyvera Q2 2026 session and run these four prompts in order:
1. "The prior quarter review section only has an outline. Can we generate content for it based on prior quarter performance?"
Exercises the typed regenerate path on PRIOR_QUARTER_REVIEW + B7 full-doc context + M6 focused_section_id parameterisation (regenerate path). Expected: PQR section fills with a coherent narrative grounded in the prior-quarter numbers.
2. "Excellent, now can you add a GM Commentary section above the PQR that gives an executive level summary of the quarter for Skyvera?"
Exercises B8 add_section + before_section_id + M6 (_draft_gm_commentary keyed on the actual section id, not the slug). Expected: GM Commentary appears above PQR in the outline, auto-regenerates, and the body summarises the quarter without re-quoting the GM section's old contents.
3. "Can you add a comment to the relevant section that has the gross margin warning from the review?"
Exercises B8.2 / B3.5 add_comment proposal flow + cross-section reasoning. Expected: Claire identifies the section carrying the gross-margin signal and proposes an add_comment Tool action anchored to a specific paragraph.
4. "How would you grade this plan for Skyvera?"
Exercises doc-wide grade synthesis (M10 prompt-cache placement matters here — the full-doc context block is required). Expected: a graded summary that references multiple sections coherently rather than just summarising one or two.
### Executed (CI/local)
- [x] cd klair-api && uv run ruff format <changed-files> clean.
- [x] cd klair-api && uv run ruff check <changed-files> clean.
- [x] cd klair-api && uv run pyright <changed-files> — 0 new errors from this PR (1 pre-existing warning in wizard_orchestrator.py confirmed unrelated).
- [x] cd klair-api && uv run pytest tests/board_doc/ -q — 1281 passing, 0 regressions (1213 pre-review-round-2 + 11 new from round-2 fixes; round-2 added TestPatchSectionCustomTransitionWarning (2), TestPatchSectionChangedFlag (3), 1 new in TestPatchSectionEmptyNoOp, 11 in test_promote_section_type_from_title.py for the R2-M3 multi-word/bare-risk frontier).
- test_chat_full_doc_block.py — 17 tests, updated for M10 (full-doc moved to user message, system prompt no longer carries it).
- test_section_crud_endpoints.py — 45 tests (39 pre-round-2 + 2 R2-H2 CUSTOM-transition + 3 R2-M1 changed-flag + 1 R2-M1 empty-PATCH changed=false).
- test_assembler_title_format.py — 10 tests pinning the unified {BU} Q{n} Y Plan format across all four 4.0 sites (review-round-1 #1). Two legacy 3.0 callsites (final_document_service.py × 3, budget_doc_generator.py × 1) intentionally retain the older format with explicit deferral comments per review-round-2 R2-H1.
- test_promote_section_type_from_title.py — 56 tests (45 pre-round-2 + 7 multi-word-risk + 4 bare-risk-no-trip for R2-M3).
- All other suites: same coverage as before, all passing.
- [x] cd klair-client && pnpm tsc --noEmit clean.
- [x] cd klair-client && pnpm eslint <changed-files> --max-warnings 0 clean.
- [x] cd klair-client && pnpm test BoardDoc --run — 236 passing (231 pre-round-2 + 5 net-new in ChatToolProposal.b8.spec.tsx: R2-L3 cancel-path button-re-enabled assertion, R2-L1 title-resolution + slug-fallback tests, round-1-deferred FE Low #9 error-path tests for createSection / deleteSection / patchSection).
- [x] klair-api/scripts/b7_smoke_chat.py — bypass-the-FE smoke harness for fast Opus 4.7 reachability + B7 cross-section reasoning validation.
### Follow-up manual validation
- [x] Open a fresh Skyvera Q2 clone via the editor; confirm reload banner fires after the background data refresh completes.
- [x] Run the four-prompt reviewer demo path above end-to-end against a real Skyvera Q2 session.
- [x] Validate Coach Claire's cross-section reasoning quality on a real Skyvera Q2 doc — ask "are the revenue numbers consistent across sections?" / "any contradictions in this doc?" — expect specific, numbers-backed catches across MIPs / Financials / GM Commentary.
## Review-round-1 deferrals (Tier 2 follow-ups)
The May-7 deep review surfaced 8 High + 22 Medium + 14 Low items. All 8 High and 13 of 22 Mediums shipped in round-1 (commits pr2750_review_round1_*); the rest are filed as follow-up tickets and tracked in .cursor/BACKLOG-budget-bot-4.md:
Model layer:
- M3 — SDK wire-shape regression test for thinking_kwargs (the Anthropic upgrade canary).
- M4 — Test that TEMPERATURE_UNSUPPORTED_MODELS gate fires for Opus 4.7.
- M5 — thinking_kwargs extra_body merge (currently monopolises extra_body; future callers wanting their own beta header get clobbered).
- L1 (Model) — Default effort="high" is the most-expensive setting; cost-aware paths should opt in explicitly.
B7:
- M9 — Tracking ticket for the May-7 "session.spec swapped mid-call" diagnostic logging (root cause unresolved; reconciliation path masks the underlying bug).
- M11 — Section inventory + full-doc body redundancy (~800 chars; small win).
- B9 — Plumb full_doc_block through generate_section so CUSTOM sections behave the same in initial-gen and regenerate paths (the principled fix M8 documents the temporary shape of).
- L (B7) — Lazy strip imports duplicated in 5 places; non-thread-safe _TITLE_TO_SECTION_TYPE global; O(n) sorted_sections.index in truncation path; B7.8 framing wording polish; _full_doc_block Optional handling; _history_to_message_params resolved_tool_use_ids or set() defaulting.
B8 BE:
- M12 — idempotency_key on POST /sections (pattern from update_section_cell; needs a per-session accepted-key dedup).
- M17 — Type-promotion replace_required_data: bool = False flag so user-curated required_data isn't clobbered on type change.
- L (B8 BE) — Gap-exhaustion strategy not documented; unrecoverable concurrent-CRUD scenarios untested.
B8 FE:
- B8.1 — Manual SectionNav context menu + "+" button (FE-only; BE is ready).
- B8.6 — Polished RemoveSectionConfirm modal component (the inline window.confirm from #8 is the immediate safety fix; the modal UX is the principled follow-up).
- L (B8 FE) — Two-rapid-add_section race coverage; auto-regen failure recovery; validator type-cast tightening; cross-session contamination on chained onSectionUpdated. (Round-1 FE Low #2 role="alert" on destructive copy and FE Low #9 error-path tests for the 3 new tool variants both folded into round-2 — see below.)
Architectural backlog (still tracked):
- B8.4 — root-cause investigation for the May-7 "spec-swap mid-call" reconciliation path.
- B7.5 — Doc-wide findings block (Phase C amplifier).
- B7.6 — Finding-status linkage on Claire proposals (closes the review→chat→review loop).
- B7.7 — Check metadata in Claire's prompt (Phase D dovetail).
- B7.9 — refresh_data Claire tool (agentic data-refresh capability + B1.7 workaround).
- B3.18 — Conversation history retention beyond the last 10 turns.
- B1.7 path (b) — Refactor refresh detector to dispatch by required_data instead of section_type (deferred indefinitely; path (a) solves the operational problem).
## Review-round-2 follow-ups
Round-2 surfaced 14 NEW findings introduced by the round-1 fixes themselves (down from 44 in round 1, concentrated in three surfaces). All 2 Highs + all 3 Mediums + 4 of 9 Lows shipped in commit pr2750_round2; the remaining 5 BE Lows are non-blocking and captured for a follow-up sweep:
Shipped in round-2:
- R2-H1 — Title-format unification: documented deferral at the two legacy 3.0 callsites (final_document_service.py × 3, budget_doc_generator.py × 1) per the reviewer's option (b). Tracking ticket B0.8b for cross-product format alignment.
- R2-H2 — patch_section M16 false-positive on CUSTOM transitions: one-line if new_type != CUSTOM: guard mirroring add_section. Two new regression tests (positive + negative-control).
- R2-M1 — changed: bool on SectionMutationResponse so M15's no-op signal reaches the FE; rename_section Accept handler now skips the structure-change refetch on value-equivalent renames.
- R2-M2 — Dropped the M10 magic-number cost claim; replaced with a directional comment + tracking ticket B7.10 for measured savings against prod telemetry.
- R2-M3 — _USER_CUSTOMISATION_SUFFIX_RE multi-word risks? carve-out so a future canonical "Risks" section type isn't silently demoted to CUSTOM. Trade-off explicit in the docstring; 11 new tests.
- R2-L1 — Confirm dialog + body caption use the human-readable section title (with section_id-slug fallback), matching the SectionNav outline + post-delete toast.
- R2-L2 — currentSessionIdRef update moved into a no-deps useEffect (concurrent-mode-safe).
- R2-L3 — Cancel-path test asserts the Accept button is re-enabled (pins the finally { setBusy(false) } contract).
- Round-1 FE Low #2 — role="alert" added to the destructive-warning chip in remove_section (folded in opportunistically while touching the file).
- Round-1 FE Low #9 — 3 error-path tests (mockRejectedValueOnce) for createSection / deleteSection / patchSection (folded in opportunistically).
Deferred to a follow-up sweep (non-blocking):
- R2-L4 to R2-L9 (BE) — 6 BE Lows from the BE subagent's findings (mix of comment polish, retry-test fixture symmetry, observation-only items). Captured in .cursor/BACKLOG-budget-bot-4.md.
- B0.8b — Cross-product title-format alignment for Goal-MIPER + the older budget_doc_generator surfaces (R2-H1 deferral).
- B7.10 — Measure prompt-cache savings against prod telemetry (R2-M2 follow-up).