## Screenshots
<img width="1909" height="940" alt="image" src="https://github.com/user-attachments/assets/3bb75503-0938-4bbc-aa96-2d242e46eacc" />
## Summary
- Replaces the dismiss-only DocChangedBanner with a "Reload from Google Doc" primary action that fetches live Drive content + persists a fresh revisionId. Unblocks the persistent-stale-revision dead-end the user hit during May 8 demo testing on Skyvera Q2.
- Adds capture-race resilience to the post-publish revisionId fetch: capture_stable_revision_id polls Drive 2–3 times with 250ms + 500ms backoff and accepts the value once two consecutive reads agree. Closes the underlying bug that produced the dead-end in the first place.
- New POST /board-doc/wizard/{id}/reload-from-doc endpoint replaces session.generated_sections with Drive's parsed content (matched back to spec section_ids by normalised title), refreshes google_doc_revision, and surfaces provenance counters (sections_replaced / sections_preserved / sections_dropped_from_drive).
- Squashed-on: B3.18(a) widens the chat-history window from 10 → 100 turns (sized for ~100 messages per quarterly doc) and adds a slice-boundary safety wrapper that fixes a latent "first turn must be from the user" bug Anthropic could trip on after the 6th chat round.
- Internal review (37 issues, 2 Crit + 4 High + 9 Med + 13 Low + 9 Nits) — addressed in commit 06ce1ad56. All 2 Criticals + 4 Highs + 9 Mediums + the quick-win Lows fixed in this PR; remaining FE L3/L4/L6/L8 + Nits filed as B0.10.1 follow-up polish.
Backlog: closes B0.10 + B3.18(a), and confirms B0.7 as already-shipped (the revision_stale field + chip were wired through during PR #2750 but never marked done in the local backlog). No Linear tickets — tracked in .cursor/BACKLOG-budget-bot-4.md.
## Why it's needed
During PR #2750 demo testing on May 8, a series of /sync pushes left the Skyvera Q2 session stuck in a catch-22:
- /sync-status consistently returned changed: true (BE log: Google Doc <id> changed: stored=AMHacu72, current=AFwiY18U)
- DocChangedBanner rendered on every page load
- Every subsequent /sync 409'd against detect_external_changes before it could capture + persist a fresh revisionId
User confirmed nobody had edited the doc externally — the divergence was internal. Root cause: Drive's revisionId keeps bumping for a few hundred ms after batchUpdate returns (internal indexing / entity-detection). The pre-fix _capture_revision_id read it in a single call; if that read caught Drive mid-bump, session.google_doc_revision ended up one bump behind Drive forever, and every detect_external_changes report flagged the doc as "changed". No FE escape hatch shy of manually clearing the field in DDB.
Two interlocking gaps to close: prevent the race when possible (stabilising capture), AND give the user a recovery path when prevention fails (Reload-from-Doc button).
## Changes
BE — capture-race resilience (klair-api/budget_bot/board_doc/gdoc_sync.py + klair-api/routers/board_doc_router.py):
- New top-level capture_stable_revision_id(document_id) polls Drive's revisionId up to 3 times with 250ms + 500ms backoff between reads. Returns the value once two consecutive reads agree (logs INFO with the backoff-that-settled-it). If 3 reads never agree, returns the most-recent value and logs WARNING so the operational pattern is visible.
- wizard_sync_to_doc now calls it via asyncio.to_thread instead of the inline _capture_revision_id lambda. ~250ms extra in the common case (one mandatory wait between reads 1 and 2 for the stable-pair confirmation), up to ~750ms when the race fires and three reads are needed.
- Transport errors (HttpError, ConnectionError, etc.) propagate as before — the helper's retries target propagation race, not transport flakiness, so a deterministic auth failure fast-fails without burning 750ms.
BE — Reload-from-Doc endpoint (klair-api/routers/board_doc_router.py):
- POST /board-doc/wizard/{session_id}/reload-from-doc re-reads the doc via read_google_doc_sections, matches Drive section titles back to spec section_ids via _normalise_section_title (lowercase + whitespace-collapse — tolerates rename drift between editor + Drive), replaces session.generated_sections with the parsed Drive content, refreshes google_doc_revision to the freshly-read value.
- Persistence via _save_with_merge_retry_or_raise so a parallel writer (chat turn, autosave) doesn't lose the reload.
- Preserves spec sections that didn't appear in Drive's parse (local-only edits stay). Drops Drive sections unknown to the spec (logged + counted).
- Returns ReloadFromDocResponse with reloaded, google_doc_id, revision, sections_replaced, sections_preserved, sections_dropped_from_drive.
- Error surface: 400 no-google-doc; 409 no-spec; 502 empty Drive return (refuses to clobber session); 502 transport error; 502-with-distinct-copy on doc-deleted-in-Drive (FE / support can tell the two flavours apart from copy alone).
- Note: reads revisionId directly from the parser's documents.get response — pure reads don't hit the indexing race, so no stabilisation needed (would just burn 750ms for no gain).
FE — Reload-from-Doc recovery surface:
- klair-client/src/services/boardDocApi.ts — new reloadFromGoogleDoc(sessionId, getToken) API client + ReloadFromDocResponse interface.
- klair-client/src/screens/BoardDoc/hooks/useDocumentEditor.ts — new reloadDocument() action on the hook return. Flushes loadedRef.current + lastSavedSectionsRef.current + bumps a reloadEpoch counter wired into both the reset and load effects' deps, so the next render re-fetches every section body. Also clears the autosave baseline so post-reload edits compute their dirty state against the freshly-fetched server content (regression target: would otherwise flip isDirty=true immediately after a reload even with no user input).
- klair-client/src/screens/BoardDoc/components/DocumentEditor.tsx — rewrote DocChangedBanner to expose a primary "Reload from Google Doc" button (with spinner + disabled state during reload), a secondary dismiss X, and an inline error chip beneath the banner copy on failure (primary affordance stays mounted for retry). New handleReloadFromDoc callback calls the API + wizard.refreshSession({ silent: true }) + editor.reloadDocument() + clears the docChanged / syncRevisionStale flags. Errors stay in the banner (don't collapse into a global toast).
- klair-client/src/screens/BoardDoc/steps/ReviewStep.tsx — same banner + handler shape for the legacy 3.0 wizard surface (any new-session-not-clone-from-prior path lands here). Per-section content cache + expanded section reset on reload so the next expand re-fetches.
Tests:
- klair-api/tests/board_doc/test_capture_stable_revision_id.py (6): stable on first read / stabilises on second backoff / never stabilises (warns) / all-empty reads (distinct warn) / transport error fast-fails / 2-Drive-call count on the common-case path.
- klair-api/tests/board_doc/test_reload_from_doc_endpoint.py (10): happy path replaces matching sections + updates revision / preserves local-only sections / drops Drive-only sections / title matching is case+whitespace-insensitive / 400 no-doc / 409 no-spec / 502 empty Drive / 502 transport error / 502 distinct copy on doc-deleted-in-Drive / 404 unknown session.
- klair-client/src/screens/BoardDoc/components/__tests__/DocChangedBanner.spec.tsx (10): primary Reload button renders + fires onReload / dismiss fires onDismiss / spinner + disabled while reloading / disabled state actually blocks click / error chip renders with server detail / no chip when no error / error chip dismiss / primary button stays mounted under error / role="alert" surface count.
- klair-client/src/screens/BoardDoc/hooks/__tests__/useDocumentEditor.reloadDocument.spec.ts (2): re-fetches every section body on reload (4 fetches vs 2 sans-reload) / autosave baseline reset keeps isDirty=false after reload.
B3.18(a) — chat history window 10 → 100 + slice-boundary safety (klair-api/budget_bot/board_doc/wizard_orchestrator.py):
- New _CHAT_HISTORY_WINDOW = 100 constant near handle_chat. Sized for ~100 messages per quarterly doc (matching DR's Claude Code reference session). ~50K tokens at ~500 tokens/turn — well inside Opus 4.7's 200K context window after the 80K full-doc cap + focused-section + findings.
- New _safe_chat_history_slice(conversation) helper takes [-_CHAT_HISTORY_WINDOW:] THEN drops any leading non-user messages so the Anthropic Messages API gets a user-first payload. Fixes a latent bug where the legacy [-10:] could 400 on the 6th chat round of any session that hadn't hit a tool-use rebalance (slice ends on the new user turn → odd-length conversation → slice head can land on an assistant turn).
- handle_chat now calls _safe_chat_history_slice(session.conversation) instead of the hard-coded [-10:].
- Other chat surfaces (GM Commentary refinement [-12:], product detail commentary [-12:]) intentionally NOT touched — they're different surfaces with their own conversation streams; B3.18 is scoped to the main handle_chat path only.
Tests:
- 8 new in klair-api/tests/board_doc/test_safe_chat_history_slice.py: window-size invariant + short-conversation passthrough + long-conversation cap + slice-boundary drop-leading-assistant + boundary-aligned full-window keep + multiple-leading-non-user defensive drop + empty input + all-assistant pathological (returns empty).
Path (b) (session-summary anchor at index 0 of the trimmed window) stays the architectural follow-up — re-open as B3.18(b) when real prompts start pushing past 100 turns OR B6 (multi-turn agent loop) lands and the summary anchor becomes load-bearing.
Backlog:
- .cursor/BACKLOG-budget-bot-4.md — B0.10 marked DONE with the shipped scope; B3.18(a) marked DONE with the squashed-on scope; B0.7 marked DONE with the "already-shipped during PR #2750, just never updated here" note; milestone summary updated.
## Breaking changes
None. The new endpoint is additive; the banner contract change is internal to DocumentEditor.tsx + ReviewStep.tsx (the banner is a local component, not consumed externally). capture_stable_revision_id is a strict superset of the prior single-read behaviour — adds latency on the racy case, identical on the common case.
## Test plan
Automated (post-internal-review):
- [x] uv run pytest tests/board_doc/ — 1327 passed (was 1316 → +11 across B0.10 + B3.18(a) + internal-review regression tests)
- [x] uv run ruff format + ruff check on changed BE files — clean
- [x] uv run pyright on changed BE files — clean (1 pre-existing tools= warning unrelated)
- [x] npx vitest run src/screens/BoardDoc src/services/__tests__ — 379 passed (was 369 → +13 across banner-density / C1 / C2 / H3 / M6 integration / BE L5 / M1-fix)
- [x] npx eslint --max-warnings 0 on changed FE files — clean
- [x] npx tsc --noEmit — clean
Manual — same 4-prompt demo path as PR #2750 + a sync at the end. Start a Skyvera Q2 2026 cloned session, then:
- [ ] Prompt 1: *"The prior quarter review section only has an outline. Can we generate content for it based on prior quarter performance?"*
- [ ] Prompt 2: *"Excellent, now can you add a GM Commentary section above the PQR that gives an executive level summary of the quarter for Skyvera."*
- [ ] Prompt 3: *"Can you add a comment to the relevant section that has the gross margin warning from the review?"*
- [ ] Prompt 4: *"How would you grade this plan for Skyvera?"*
- [ ] Click Sync at the end and confirm a clean 200 (no revision_stale chip, no DocChangedBanner, no 409 on a follow-up sync).
What this naturally covers:
- The closing sync exercises capture_stable_revision_id (B0.10 BE Part 1) on the happy path — the new stabilising poll runs post-publish on every sync.
- The 4-prompt sequence exercises the widened chat history (B3.18(a)) — multiple regenerate_section / add_section / add_comment tool rounds accumulate tool_use + tool_result blocks fast, so by prompt 4 the conversation depth is well past the legacy 10-message slice. If Claire stays grounded in the original framing (BU, quarter, the regenerated PQR content) rather than asking "what were we doing again?", the bump is doing its job.
What this does NOT cover (intentionally — automated tests handle these):
- The Reload-from-Doc FE button + recovery flow (B0.10 FE). Only renders when /sync-status reports changed: true, which the natural sync path doesn't produce. Covered by the 10 DocChangedBanner component tests + the 2 useDocumentEditor.reloadDocument hook tests.
- The POST /reload-from-doc endpoint (B0.10 BE Part 2). Covered by the 10 endpoint tests (happy path, title normalisation, preservation, drop, 400/409/502/404).
- The capture-race repro (3-reads-no-agreement path). Covered by the 6 capture_stable_revision_id tests.
## Follow-ups
- B0.10.1 — non-blocking polish deferred from this PR's internal review: FE L3 (typed error class for reloadFromGoogleDoc), FE L4 (reloadDocument JSDoc), FE L6 (saveInFlightRef lifecycle co-ownership on loadAll error), FE L8 (double onActiveSectionChange on reload), assorted Nits. Tracked in .cursor/BACKLOG-budget-bot-4.md.
- B0.7 confirmed as already-shipped during PR #2750; backlog updated.
- B1.7 path (b) (clone-aware refresh detector) — unchanged.
## Review history (internal)
PR went through one internal-review pass before reviewer escalation (37 issues across BE + FE — 2 Crit, 4 High, 9 Med, 13 Low, 9 Nits). All Criticals + Highs + Mediums + quick-win Lows fixed in commit 06ce1ad56; the test surface for each fix is pinned with a dedicated regression test so a future refactor that drops the fix surfaces at test time. Highlights:
- C1 + C2 (autosave races on the FE recovery surface) — cancelPendingPersist() action exposed from the hook; called at the top of both reload handlers BEFORE the API call. Confirm-discard prompt added when isDirty. Integration test mounts DocumentEditor, dirties the editor, clicks Reload, asserts no updateSection PUT fires.
- H1 (BE preservation loop reading stale snapshot) — moved INSIDE the save_with_merge_retry closure so it reads from the freshly-refetched session. Regression test simulates a sibling write landing between the Drive read and save.
- H4 — DocChangedBanner extracted to a shared component with a density prop; both the 4.0 ('compact') and 3.0 ('comfortable') surfaces import it, the spec covers both densities.