## Summary
- Migrates 6 pipelines from \klair-udm\/\klair-misc\/\klair-api\ into Surtr CDK infrastructure
- All schedules disabled (\enabled: false\) — enable after prod validation
- \cdk synth\ passes for all 6 pipelines (dev stacks)
## Pipelines
| Pipeline | Compute | Klair source | Writes to |
|---|---|---|---|
| \google-sheets-surveys-sync\ | Lambda | \klair-api/edu_schools/surveys/\ | \staging_education.surveys_t2p\, \surveys_parent_session\, \surveys_dop_goal_meetings\ |
| \school-financial-models-sync\ | Lambda | \klair-api/edu_schools/google_sheets_models/\ | \staging_education.google_sheets_school_financial_models\ |
| \holdings-model-sync\ | Lambda | \klair-api/edu_schools/google_sheets_models/\ | \staging_education.holdings_unit_economics\ |
| \kubera-passive-investments\ | Lambda | \klair-udm/kubera/\ | \core_finance.passive_investments_portfolio_*\ |
| \quicksight-data-scraping\ | ECS/Fargate | \klair-misc/qs_data_scraping/\ | \staging_education.fall_to_spring_growth\, \student_growth_by_language\ |
| \brokerage-ocr\ | Lambda (thin wrapper) | \klair-udm/brokerage-ocr/\ (SAM) | S3 (via existing SAM state machine) |
---
## Migration plans
### google-sheets-surveys-sync
What it does: Reads three Google Sheets survey sources (T2P, Parent Session, DOP Goal Meetings) and does TRUNCATE + INSERT into three Redshift tables in \staging_education\. Fully idempotent.
Current trigger: Daily cron in Klair (enabled). Klair source: \klair-api/edu_schools/surveys/sync_surveys_to_redshift.py\.
Cutover sequence:
1. Provision \surtr/google-service-account\ secret (service account JSON)
2. Provision \surtr/surveys-sheet-urls\ secret (JSON with keys \t2p_sheets\, \parent_session_sheets\, \dop_sheets\ — copy sheet URLs from Klair config)
3. Deploy: \npx cdk deploy Pipeline-google-sheets-surveys-sync-dev -c env=dev\
4. Run manually and validate row counts in Redshift match Klair output
5. Disable Klair schedule, enable Surtr schedule (\enabled: true\)
---
### school-financial-models-sync
What it does: Reads specific cell ranges from the Austin K-8 financial model Google Sheet and does TRUNCATE + INSERT into \staging_education.google_sheets_school_financial_models\. Fully idempotent. School config is hardcoded in \handler.py\ (44 cell references for Austin K-8).
Current trigger: Manual / ad-hoc in Klair. Klair source: \klair-api/edu_schools/google_sheets_models/sync_school_financial_models.py\.
Cutover sequence:
1. Provision \surtr/google-service-account\ secret (shared with other Google Sheets pipelines)
2. Deploy: \npx cdk deploy Pipeline-school-financial-models-sync-dev -c env=dev\
3. Run manually and validate row counts match Klair output
4. Enable Surtr schedule if desired (currently no Klair schedule to disable)
---
### holdings-model-sync
What it does: Reads the AlphaSchools tab from the Alpha Holdings Model Google Sheet and does TRUNCATE + INSERT into \staging_education.holdings_unit_economics\. Fully idempotent.
Current trigger: Daily cron in Klair (enabled). Klair source: \klair-api/edu_schools/google_sheets_models/sync_holdings_model.py\.
Cutover sequence:
1. Provision \surtr/google-service-account\ secret (shared)
2. Provision \surtr/holdings-model-config\ secret: \{"sheet_url": "<Alpha Holdings Model URL>"}\
3. Deploy: \npx cdk deploy Pipeline-holdings-model-sync-dev -c env=dev\
4. Run manually and validate Redshift output
5. Disable Klair schedule, enable Surtr schedule
---
### kubera-passive-investments
What it does: Refreshes 5 portfolio overview cache tables in \core_finance\ (summary, metrics, performance, risk, chart data) by reading from existing \passive_investments_holding_over_time\ and related tables. Runs in overview-only mode by default (step 5 only — steps 1–4 are handled at runtime by the Klair API). Full rebuild can be triggered via \params.override_all_assets=true\.
Idempotency: Yes — TRUNCATE + INSERT on all 5 cache tables.
Current trigger: Daily cron in Klair (enabled). Klair source: \klair-udm/kubera/lambda_handler.py\.
Porting notes: Uses \redshift-connector\ with \iam=True\ (Lambda execution role) to preserve the complex pandas-based query patterns from \run_investment_pipeline.py\. Removed module-level \raise ValueError\ guards for unused env vars (ALPHA_VANTAGE_API_KEY, S3_DUMP_PATH, S3_TEMP_PATH, IAM_ROLE) so the module can be imported in overview-only mode without those vars set.
Cutover sequence:
1. Verify Lambda execution role has \redshift:GetClusterCredentials\ for the cluster (IAM statement already in \pipeline.json\)
2. Deploy: \npx cdk deploy Pipeline-kubera-passive-investments-dev -c env=dev\
3. Run manually (overview-only) and verify the 5 cache tables are populated correctly
4. Disable Klair schedule, enable Surtr schedule
---
### quicksight-data-scraping
What it does: Uses Playwright (headless Chromium) to log into the QuickSight dashboard, set the Campus filter to Alpha Austin, download two CSV exports ("Fall to Spring Growth Multiples" and "MAP Growth per Level"), then DELETE + INSERT into two Redshift tables. Requires ECS/Fargate because Playwright browser binaries exceed Lambda's 250 MB package limit.
Idempotency: Yes — DELETE + INSERT on each run.
Current trigger: Manual / ad-hoc in Klair. Klair source: \klair-misc/qs_data_scraping/\.
Porting notes: Replaced \push_bulk_to_redshift\ (S3 COPY) with \push_batch_to_redshift\ (direct INSERT via redshift-connector IAM auth) to avoid needing an S3 staging bucket. Credentials fetched from Secrets Manager at runtime instead of env file.
Cutover sequence:
1. Provision \surtr/quicksight-credentials\ secret:
\\\json
{
"username": "user@alphaschool.org",
"password": "...",
"dashboard_url": "https://us-east-1.quicksight.aws.amazon.com/...",
"login_method": "aws_sso"
}
\\\
2. Deploy: \npx cdk deploy Pipeline-quicksight-data-scraping-dev -c env=dev\
3. Run manually via ECS (or Step Functions console) and validate Redshift tables
4. Enable Surtr schedule if desired (no Klair schedule to disable)
---
### brokerage-ocr
What it does: Thin Lambda wrapper that starts the existing Brokerage OCR SAM Step Functions state machine and polls until completion. The actual OCR logic lives untouched in the SAM stack — this wrapper purely adds Surtr dashboard visibility and run history.
Current trigger: Manual in Klair. Underlying SAM state machine handles scheduling independently.
Porting notes: No business logic change. The SAM state machine ARN must be set in \pipeline.json\ under \environment.BROKERAGE_OCR_STATE_MACHINE_ARN\ before deploying.
Cutover sequence:
1. Find the SAM state machine ARN (check CloudFormation or Step Functions console)
2. Set \BROKERAGE_OCR_STATE_MACHINE_ARN\ in \pipelines/runners/brokerage-ocr/pipeline.json\
3. Deploy: \npx cdk deploy Pipeline-brokerage-ocr-dev -c env=dev\
4. Trigger a manual run from Surtr and confirm the SAM execution completes successfully
---
## Key porting decisions
- gspread auth: Switched from file-based \service_account(filename=...)\ to \service_account_from_dict()\ (reads from Secrets Manager JSON)
- Redshift writes: Lambda pipelines use Redshift Data API; kubera uses \redshift-connector\ with \iam=True\ to preserve complex pandas query patterns
- numpy pin: Pinned \numpy<2.0\ to avoid GCC 9.3 requirement on the CDK Lambda build container (Amazon Linux 2 / GCC 7.3.1)
- quicksight S3 COPY → batch INSERT: Avoids needing an S3 staging bucket in Surtr
## Test plan
- [x] \cdk synth\ passes for all 6 pipelines (dev stacks)
- [ ] Deploy each pipeline to dev
- [ ] Provision required secrets in dev
- [ ] Manual Step Functions / ECS execution for each pipeline
- [ ] Validate Redshift row counts match Klair output
- [ ] Disable Klair schedules and enable Surtr schedules
🤖 Generated with [Claude Code](https://claude.com/claude-code)