## Screenshots
<img width="1585" height="652" alt="Screenshot 2026-04-21 141135" src="https://github.com/user-attachments/assets/e64df1dc-ce3f-4405-9474-46dfba88f231" />
<img width="1615" height="687" alt="Screenshot 2026-04-21 141214" src="https://github.com/user-attachments/assets/e4713e62-1a8c-4bb7-a92e-639c51c85a79" />
<img width="936" height="885" alt="Screenshot 2026-04-21 141329" src="https://github.com/user-attachments/assets/7739571d-7747-42c9-99c2-901c44fc6dcf" />
<img width="1592" height="898" alt="Screenshot 2026-04-21 141410" src="https://github.com/user-attachments/assets/127dff8d-f05d-4fcc-a421-218d72262399" />
<img width="1601" height="867" alt="Screenshot 2026-04-21 141435" src="https://github.com/user-attachments/assets/303ebad4-34c2-43fb-8f0c-f5ba980a6d81" />
---
## Summary
Hardens the analytics-worker projection layer in three categories, each driven by a specific finding from this week's [Convex projections audit](../blob/main/context_cache/audit_aerie_convex_projections.md):
1. Tier 1 — reliability: kill the "Many documents read" warnings the worker logs every cycle on joeChartsBudgetActuals and similar tables (~29k/32k docs read per upsert today).
2. Tier 2 — stop wasting work: drop projections with no consumers; refresh static reference tables on a daily/weekly cadence instead of hourly.
3. Tier 3 — observability: add live freshness indicators to dashboard headers so leadership can see "data as of X" at a glance — counters the recurring "but is this even current?" question that drives most "let's just hit Redshift" momentum.
Goal is to make the hourly worker boring: no warnings, no failed batches, no wasted writes.
---
## Prod impact (please prioritise this PR's review)
Dev Convex went from 18 distinct critical/warning insights to 5 after the original PR #118 commits landed. Verified on the Aerie dev Convex Health -> Insights dashboard.
The 13 that disappeared map directly to PR #118 work:
- 5 joeCharts:upsert* "Nearing documents read limit" criticals/warnings -> resolved by Tier 1 composite-index refactor
- 4 replaceEnrollmentCohortStudents / marketing:upsertMarketingEventContacts / enrollment:clearPipelineStudents / marketing:insertWeeklyDeposits write-conflict criticals -> resolved by .unique() migration + dead-projection drops reducing write contention
- 4 retried-write-conflict warnings on the same tables -> same root cause
Prod is still on main and is therefore still hitting all 18 — including the criticals. Until this PR merges and deploys, the prod worker is one growth nudge from upsertBudgetActuals failing the 32k document read limit (it was at ~29k as of the audit).
The two commits added since the original review (below) cover 2 of the remaining 5 dev warnings — the most dangerous one (a separate read-limit risk on _checkPipeline at 31k/32k) and a silent data-loss bug on expenseTransactions that was under-counting prod by every multi-line QuickBooks expense. After these merge, dev should be at 3 warnings, all auto-retried tolerable noise.
### Follow-up commits added since initial review
| Commit | Issue | Fix |
|---|---|---|
| [a8b2b7e](#) | expenseTransactions was keying upserts on transactionId alone. QuickBooks expenses split across multiple schools share one transaction_id (e.g. one Dell invoice → $379 Alpha Miami + $1,137 Austin K-8). Each subsequent line silently overwrote the previous via .unique() patch. Local Aerie was under-counting prod by 102 vs 101 in the 4/13–4/19 sample — exactly the multi-line transactions in the window. | Project the source PK (staging_education.quickbooks_expense_transactions.id BIGINT IDENTITY) as sourceRowId, key the Convex upsert on it. Matches Klair's quickbooks_expense_analysis_service.py which surfaces id directly. Logs when a patch changes line amount materially (audit trail). Adds pnpm backfill-expense-transactions for fast local re-population. Verified: 102 rows / $37,111.41 in window, exact parity with Klair. |
| [d2ce81c](#) | dataConsistency:_checkPipeline was at 31k/32k document reads — 3% from the same total-failure mode the joeCharts upserts had. pipelineFunnel accumulates history (every snapshot/program/stage row forever), and the function .collect()'d the whole table just to find the latest snapshot per program. | Two indexed reads per program (latest-date probe + same-date scan) using existing by_programCode_snapshotDate_stageId index. Reads drop from O(historyDepth × programs × stages) → O(programs × (1 + stages)) — about 310 reads vs 30k+ for a 39-program / 7-stage funnel. Verified locally against prod-imported data: runConsistencyCheck returns 663 stages across 39 programs, no truncation, no warnings. |
| [b4ad842](#) | Joe Charts → Financial → Model vs Actuals showed "Actual $/student" 2–10× inflated vs Klair across every category. Two coordinated bugs: (a) queryPlRecords did not ::text-cast its date columns, so the Postgres driver returned JS Date objects which z.coerce.string() then mangled into "Thu Jan 01 2026 00:00:00 GMT-0600 ..." strings that broke every period filter; (b) getModelVsActuals filtered >= periodStart with no upper bound while Klair's _compute_period_dates returns (start, today) and uses >= start AND <= today, so forward-looking fct_pl entries inflated YTD totals. | Added ::text cast to queryPlRecords (sync side) and execPeriodEnd() helper applied at the one Joe Charts site that aggregates from plRecords (chat side). Deliberately left the 5 other Joe Charts queries that aggregate from joeChartsBudgetActuals (getKPIs, getBudgetVarianceByClass, getExpenseBreakdown, getAlerts) on >= start only — Klair's equivalent queries against consolidated_budgets_and_actuals use the same start-only pattern (joe_charts_edu_service.py:233, :720), and adding an upper clamp would zero out current-quarter KPIs whenever month-end snapshot dates fall after today. Also adds pnpm backfill-pl-records for fast local re-population. Verified: Model vs Actuals (Holdings, Portfolio, YTD, 3,200 students) now matches Klair row-for-row across all 9 categories. |
### Known follow-up
| Dashboard | Issue | Plan |
|---|---|---|
| Joe Charts → CAPEX & AP | Small data drift vs Klair: Total Open card shows 166 invoices / $16.70M period AP vs Klair's 196 / $16.50M; Current Days $-7K and 91+ Days $-10K despite matching invoice counts. Per-bucket counts and amounts for 1-30 / 31-60 / 61-90 buckets match exactly. | Three projection-layer bugs: (1) r.lineAmount ?? r.totalAmount fallback in refresh.ts:1095 inflates totals when source rows have null line_amount (Klair uses 0); (2) double pro-rate of balance — once at projection write time, again at dashboard read time; (3) total_transactions returns exploded line-item count rather than source-row count. Cleanest fix flattens joeChartsQBAP to per-line rows matching Klair's source schema and removes the JSON re-explode round-trip in the dashboard. Out of scope here — will be a follow-up PR (~2-3h refactor + verification). This is the last apparent Joe Charts parity issue. |
### Intentionally deferred with rationale
| Issue | Rationale |
|---|---|
| Medium #9 — Redshift queries still run every cycle | Cadence only skips Convex writes. Convex cost dominated the hot path; Redshift queries on static reference tables are milliseconds. Documented; not a safety issue. |
| Medium #10 — Optional fields in composite keys | Duplicate check on imported prod data showed 0 collisions across all 7 tables. Fixing in-mutation creates duplicates against existing data; needs a one-time migration PR. File issue. |
| Medium #11 — NetSuite AP status in composite key | Needs finance input on whether table semantically wants "current snapshot" (drop status from key) or "history" (keep). File issue. |
| Medium #12 — UTC date in tooltip can be off-by-one | Minor UX polish. Swap to toLocaleDateString(...) in a follow-up. |
| Low #14 — Redundant Tooltip.Provider on badge | Nested providers are a no-op per Radix docs; keeping the self-wrap lets the component work in unit tests without additional setup. |
| Low #15 — new args object per useQuery render | Convex compares serialized args; useMemo is micro-optimization that trades clarity for ~nothing. |
| Low #16 — rolling vs calendar interval semantics | Intentional and documented in module comment. Fine for an internal tool. |
| Nit #17 — queryFunnelActivity location in PR description | Corrected in this body update. |
---
## Tier 1 — Reliability
Seven joeCharts upserts shared the same anti-pattern: a partial withIndex() narrowing followed by .filter() over the trailing fields of the unique key. For popular dimensions (e.g. Holdings BU on upsertBudgetActuals) this scanned thousands of docs per record and tripped Convex's "Many documents read" warning at ~29k/32k.
Fix per table: add a composite by_uniqueKey index covering the FULL unique key, then refactor the mutation to use .unique() against that index. Lookup goes from O(n-per-key) to O(log n).
| Mutation | Old: index + filter | New: composite index |
|---|---|---|
| upsertBudgetActuals | (BU, dataSource) + filter (period, className) | (BU, dataSource, period, className) |
| upsertStudentEnrollment | (campusName) + filter (status, quarter) | (campusName, status, quarter) |
| upsertMapGrowth | (campus) + filter (termName) | (campus, termName) |
| upsertMapGrowthBySubject | (campus, subject) + filter (grade) | (campus, subject, grade) |
| upsertFinancialModels | (schoolName, modelName) + filter (metricName) | (schoolName, modelName, metricName) |
| upsertCapex | (periodType) + filter (className) | (className, periodType) |
| upsertNetSuiteAP | (periodType) + filter on 4 fields | (vendorName, periodType, subsidiaryName, className, status) |
Also bonus correctness: .unique() enforces the invariant the schema implies (one row per composite key) where the old .filter().first() would silently pick whichever row the index returned first if duplicates ever existed.
---
## Tier 2 — Stop wasting work
### 2a. Stop refreshing dead projections
analyticsDeals and contactActivity have zero runtime consumers — no dashboard, AI chat tool, or cron query reads them; they were only self-merged by the writer (audit Finding 3). The cron stops calling upsertDeals and upsertContactActivity; queryFunnelActivity is also dropped since nothing else needs it. queryDeals is kept because contact enrichment uses deals to derive enrollment periods even though we don't persist them anymore.
Schema entries and the upsert mutations themselves are preserved with @deprecated JSDoc — a follow-up PR will drop the tables once we're sure existing rows are safe to remove for a release cycle. Two-step on purpose so any forgotten consumer surfaces in this PR's bake-in window rather than as a runtime error.
### 2b. Per-table refresh cadence
New module sync/src/analytics/refresh-cadence.ts lets domains opt into weekly or daily refresh instead of the default hourly. Applied to the seven non-chat-only static tables flagged in the audit:
| Cadence | Domains |
|---|---|
| weekly | schoolMappings, crossSystemMappings, joeChartsCapacityMap, joeChartsSchoolMetadata |
| daily | programs, joeChartsUnitEconomics, joeChartsFinancialModels |
Implementation is opt-in: anything not in REFRESH_TIERS defaults to hourly so behaviour for time-series tables (snapshots, projections, funnel) is unchanged. In-memory tracker on the worker process — restart forces a refresh on next cycle, which is fine.
Per-table chat-only projections are not touched in this PR.
---
## Tier 3 — Freshness indicators on dashboards
Counters the "but how do I know if this data is even current?" anxiety that drives most "let's just hit Redshift directly" momentum. Reactive Convex makes the live badge nearly free — the moment a fresh projection lands, the badge re-renders without a refresh button.
- chat/convex/analytics/freshness.ts — getDomainFreshness + getSingleDomainFreshness queries for five domains (admissions, joeCharts, wiki, pmo, camps). Uses Convex's always-available _creationTime system index (.order("desc").first()) so it works against any representative table regardless of whether it has a snapshotDate index.
- New <DashboardFreshnessBadge> component — color-tiered (fresh <2h green / aging <24h yellow / stale red / unknown slate). Self-wraps with Tooltip.Provider so it works in any rendering context including unit tests.
- Wired into FunnelView (admissions) and CampsView (camps) as a PoC. Other dashboards can adopt the same one-liner: <DashboardFreshnessBadge domain="..." />.
---
## Tests
| Suite | Result |
|---|---|
| sync (full suite, 46 files, 654 tests) | passing |
| chat/convex/freshness.test.ts (4 new) | passing |
| chat/convex/buildoutDetails.test.ts (12) | passing |
| chat/convex/dashboards.test.ts (41) | passing |
| chat/convex/eventProcessor.test.ts (42) | passing |
| chat/components/dashboards/admissions/camps/__tests__/camps-view.test.tsx (8, modified due to badge integration) | passing |
New test coverage:
- 6 unit tests on refresh-cadence covering tier defaults, independent domain tracking, hourly fallback for unlisted domains, and a REFRESH_TIERS allow-list test that catches accidentally adding chat-only tables here.
- 4 query tests on getDomainFreshness — empty-table case, non-empty selection, single-domain variant, unknown-domain fallthrough.
- Updated 2 refreshEntityRecords tests to reflect the new "deals not upserted" semantics; failure-isolation property still under test.
Pre-existing failures NOT fixed by this PR: lib/__tests__/agent.test.ts has 2 getDataDir tests that fail on main today (verified). Unrelated to this work; tracked separately.
---
## Deploy sequence (one-time)
The expense sourceRowId migration (commit a8b2b7e) leaves legacy rows in
prod that the new upsert can't see -- without the purge below, every sync
cycle would INSERT a fresh row alongside each legacy row and the Expense
Analysis dashboard would double-count. Run these steps in order when
deploying this PR to any environment with pre-existing expenseTransactions
data:
1. Merge & deploy (Convex schema + functions land first via the existing
pipeline).
2. Pause the analytics worker (or wait until it's idle between cycles)
so it can't re-insert ghost rows mid-purge.
3. Run the purge until hasMore returns false -- it pages 4k rows per
call to stay under the Convex 32k document read limit:
# repeat until { ..., "hasMore": false }npx convex run analytics/expenses:_clearLegacyExpenseTransactions
4. Repopulate with the corrected schema. Either resume the worker (next
cycle picks up where the watermark left off) or run the focused backfill
for faster validation:
pnpm --filter @bran/sync backfill-expense-transactions --dry-run # previewpnpm --filter @bran/sync backfill-expense-transactions # for real
# The script prompts for confirmation when CONVEX_URL points at a non-dev
# deployment; set BACKFILL_CONFIRM=yes to bypass for non-interactive runs.
5. Resume the analytics worker.
6. Spot-check the Expense Analysis dashboard -- counts and totals should
match Klair's QuickBooks Expense Analysis view for the same date window
(we verified 102 rows / $37,111.41 in 4/13-4/19 against prod Klair).
P&L records (commit b4ad842) follow the same pattern but are simpler --
the worker will overwrite the bad date strings on its next cycle, or you
can force-refresh with pnpm --filter @bran/sync backfill-pl-records. No
purge needed because the upsert keys on plId (which existed pre-migration).
Follow-up PR after a bake-in window: tighten sourceRowId: v.optional(v.string())
to v.string() in analyticsSchema.ts so this class of ghost-row bug cannot
recur.
## Test Plan for Reviewer
CI runs tests, typecheck, and lint automatically.
### 1. Read the diff
- [ ] Confirm the 7 by_uniqueKey composite indexes in chat/convex/analyticsSchema.ts cover the FULL unique key for each table (matches the table in the Tier 1 section above)
- [ ] Confirm each refactored mutation in chat/convex/analytics/joeCharts.ts uses the new index AND switches .first() → .unique()
- [ ] Confirm REFRESH_TIERS in sync/src/analytics/refresh-cadence.ts only contains the 7 non-chat-only static tables
- [ ] Confirm the @deprecated mutations in chat/convex/analytics/entities.ts are still present (preserved for ad-hoc backfills) but no longer called by the cron
### 2. Local worker run (recommended — concrete signal that the reliability fix works)
# from repo root, with local Convex pointing at prod-like data# (run pnpm sync:pull-prod-data first if your local is stale)
cd sync
$env:DATA_DIR = "../data" # PowerShell — bash: export DATA_DIR=../data
npx tsx run-analytics-worker.ts # or pnpm worker:analytics
Wait through one cycle (~5-10 min) and confirm:
- [ ] Each joeCharts push logs ✓ {table}: N records — zero "Many documents read" warnings, zero "X/N records failed" lines
- [ ] [analytics] Cross-system mappings pushed to Convex: N schools appears on first cycle (cadence-gated path fires when no prior run is recorded)
- [ ] No log lines for analyticsDeals or contactActivity upserts (those are now dropped)
- [ ] Joe Charts financial refresh completes in ~60s (previously slow + warning-prone)
After cycle 2 (let the worker keep running):
- [ ] [analytics] ⏭ {domain}: skipped (within {tier} cadence) lines appear for each tiered domain (cycle 2 within the daily/weekly window)
### 3. Local UI validation (optional)
docker compose build chat && docker compose up -d chat
Then open the browser:
- [ ] Open Admissions → Funnel → confirm freshness badge appears in the top filter row
- [ ] Open Admissions → Camps → confirm freshness badge appears (right-aligned)
- [ ] Hover the badge → tooltip shows "Data as of YYYY-MM-DD (Xh ago)"
- [ ] Confirm badge colour matches expected tier (green <2h, yellow <24h, red older, slate when no snapshot)
### 4. Direct query validation (optional, fast)
cd chatnpx convex run analytics/freshness:getDomainFreshness
Should return all 5 domains with snapshotDate and ageMs populated (or null/null if your local Convex is empty). Local validation result above shows what to expect when prod-like data is imported.
### 5. Post-deploy spot-check
After this PR merges:
- [ ] Tail prod analytics-worker logs through one full cycle — confirm zero "Many documents read" warnings (was firing on upsertBudgetActuals previously)
- [ ] Confirm the cadence-skip lines appear on the second cycle for the tiered domains
- [ ] Confirm dashboards still render correctly (no missing data from the removed analyticsDeals / contactActivity upserts)
---
## Risks & Mitigations
| Risk | Mitigation |
|---|---|
| Adding a 4th index per table increases write cost | Each index is small (3-5 fields, no large strings). Net win because the upsert .unique() path drops the dominant cost (32k+ doc reads). Convex pricing model favours fewer reads over fewer indexes. |
| Cadence in-memory state lost on worker restart | First cycle after restart refreshes all tiered domains; acceptable cost. Persisting cadence to Convex would add complexity for marginal benefit (worker restarts are rare). |
| @deprecated dead-projection mutations still callable | Intentional — preserved for ad-hoc backfills if a consumer ever materialises. Real cleanup (drop tables, drop mutations) is a separate PR after a bake-in window. |
| Freshness badge picks the wrong representative table | Mapping is hand-coded and greppable; trivial to swap a table if a domain owner disagrees. The badge degrades gracefully to "Unknown" when its representative table is empty. |
| Existing rows have duplicates on the new composite key (would break .unique()) | Validated locally against prod-imported data: zero collisions across all 7 tables. If prod somehow has duplicates we don't have, the worker logs a clear errors[label] from the catch — visible in the next cycle's log. |
---
## What's NOT in this PR
- AI Chat-only projection demotion (audit Finding 2) — needs usage data on the chat agent before deciding. Held back per current direction.
- JSON-string → sub-table migrations (audit Finding 5) — structural debt, deserves its own PR per table with backfill code and rollout plan. File issue.
- Incremental sync for huge tables (audit Recommendation 7) — architectural change to runRefreshCycle, deserves its own discussion.
- Drop the analyticsDeals / contactActivity schema entries — wait one release cycle to confirm no hidden consumers, then remove tables in a follow-up.