# Phase G — Data-minimisation refactor **Date opened:** 2026-05-16 **Status:** Planning. No code yet — awaiting sign-off on this doc. ## Goal Drop "server holds your portfolio" from the threat model. After this phase, Cassandra at rest knows: email, password hash, billing state, AI cost ledger, a non-attributed set of tickers, and current market prices for those tickers. It does **not** know which user holds what, at what cost, at what quantity. Holdings live in the browser (localStorage). The server acts as a price proxy that returns the **entire ticker universe** to every authenticated client, so the request itself can't betray the user's pie. AI commentary is the only path where holdings transit the server, and it does so **in-memory for the duration of one LLM call**, never persisted. ## The shape ``` ┌──────────────────────────────────────────────────────────┐ │ Browser (localStorage) │ │ • parsed pie: positions, qty, avg_cost │ │ • derived: P/L, sector tilt, sparkline cache │ └──────────────────────────────────────────────────────────┘ │ GET /api/universe (full payload, gzipped) │ POST /api/portfolio/parse (CSV → parsed pie) │ POST /api/analyze (pie + prices → AI text) ▼ ┌──────────────────────────────────────────────────────────┐ │ Server │ │ • users(email, hash, tier) │ │ • ticker_universe(ticker, currency, last_referenced_at) │ │ • quotes (already exists — keyed by ticker) │ │ • strategic_logs / indicator_summaries (shared, macro) │ │ • ai_calls (cost ledger, no holdings) │ │ ✗ NO positions table │ │ ✗ NO portfolio_snapshots table │ │ ✗ NO per-user holdings, ever │ └──────────────────────────────────────────────────────────┘ ``` ## Privacy properties this buys 1. **Holdings are not at rest**. Server never writes a row that says "user X holds ticker Y". A full DB dump reveals only the *union* of all users' tickers, with no attribution. 2. **Price-refresh requests are unlinkable**. Every authenticated user gets the same payload (entire universe), so access logs / breach evidence can't tell holdings from request bodies. 3. **AI analysis is ephemeral**. Holdings transit memory only during one LLM call (~5-30s). No DB persistence, no logs of pie content. ## Privacy properties this does NOT buy 1. **Server briefly sees the pie** during `/api/portfolio/parse` (CSV upload) and `/api/analyze`. This is "minutes-of-retention, in-memory" not "zero-knowledge". GDPR-honest framing: *"shortest possible processing window, no retention."* 2. **Universe-add timing leak**. If only one user is active when a new ticker enters the universe, that ticker is linkable to that user via timestamps. Mitigation in plan below. 3. **Email is still PII**. Paddle billing requires it; nothing to do about that. Document clearly in privacy policy. ## Data model changes ### New tables ```python class TickerUniverse(Base): """The set of public tickers Cassandra tracks. Populated as the union of all user holdings, *without user attribution*.""" __tablename__ = "ticker_universe" yahoo_ticker: Mapped[str] = mapped_column(String(32), primary_key=True) currency: Mapped[str | None] = mapped_column(String(8)) first_seen_at: Mapped[datetime] = mapped_column(DateTime(timezone=True), default=utcnow) # Refreshed by any user heartbeat that contains this ticker. # When utcnow() - last_referenced_at > UNIVERSE_EVICTION_TTL, prune. last_referenced_at: Mapped[datetime] = mapped_column(DateTime(timezone=True), default=utcnow) ``` ### Removed tables (migration 0009) - `positions` - `portfolio_snapshots` - `portfolios` (The `Portfolio` model concept goes away. A user "having a portfolio" is now purely a browser-localStorage concept.) ### Kept as-is - `users`, `email_otps` — auth - `quotes`, `quotes_daily` — price data - `headlines`, `feeds` — news - `strategic_logs`, `indicator_summaries`, `ai_calls` — macro AI (shared) - `instrument_map` — T212 ↔ Yahoo resolution (admin-managed, read-only to user paths) ## New API surface ``` GET /api/universe Auth: session/bearer required. Returns the full universe with current prices, gzipped JSON: { "as_of": "2026-05-16T14:00:00Z", "tickers": { "AAPL": {"p": 234.56, "c": "USD", "d": {"1d": 0.5, "1m": 3.2, "1y": 18.4}}, "VWRL.L": {...}, ... } } Cache-Control: max-age=60. Browser refreshes once a minute. GET /api/universe/sparkline/{ticker} Auth required. Lazy-loaded on hover. Same shape as today. POST /api/portfolio/parse Auth required. multipart/form-data: file=. Server: parses, resolves T212→Yahoo via instrument_map, adds resolved tickers to ticker_universe (no user FK), returns parsed pie to browser. Discards parsed pie before responding. Response: { "positions": [ {"yahoo_ticker": "AAPL", "name": "Apple Inc", "qty": 5, "avg_cost_gbp": 178.40, "currency": "USD"}, ... ], "base_currency": "GBP", "warnings": ["3 unmapped tickers: ..."] } POST /api/analyze Auth required. Body: {"positions": [...], "prices": {...}, "anchor": "..."}. Server constructs prompt, calls LLM, returns commentary text. No DB writes mentioning positions. ai_calls row written (no pie content). Optional: cache commentary text keyed by sha256(positions canonical JSON) so re-clicking is free. The hash is not reversible to holdings. Response: {"content": "...", "model": "...", "generated_at": "..."} POST /api/universe/heartbeat (optional, see "Open questions" below) Browser periodically POSTs its localStorage ticker set so the server can refresh last_referenced_at for those tickers. The "active client bumps timestamps" pattern keeps the universe trimmed to actually-held tickers. ``` ### Endpoints removed - `POST /api/portfolios/upload` (Phase B) — replaced by `/api/portfolio/parse` - `GET /api/portfolio/{name}/summary` — gone; browser computes from localStorage + universe prices ## Mitigation: universe-add timing leak The naive "INSERT IGNORE on CSV parse" lets a passive observer link a universe-row's `first_seen_at` to a specific user's upload time. Two mitigations, layered: 1. **Batch additions.** New tickers don't enter `ticker_universe` directly from the request handler. They're queued (in Redis or in an in-process buffer) and flushed at fixed 5-minute boundaries. Multiple users' uploads batch together; ordering within a flush is randomised. 2. **Padding.** On every flush, also re-touch `last_referenced_at` on N random existing universe rows. This makes "row updated at flush time T" not specifically informative about new tickers. At low user counts (alpha), the leak is mathematically unavoidable; document this in the alpha tester agreement and skip both mitigations until we have ≥10 concurrent users. ## Migration sequence - [ ] **0009_drop_portfolio_tables.py** — drop `positions`, `portfolio_snapshots`, `portfolios`. Upgrade extracts distinct tickers from `positions` first to seed `ticker_universe`. Downgrade is one-way (irreversible drop) — document this. - [ ] **0010_ticker_universe.py** — create `ticker_universe` table. Could be merged into 0009; keep separate for clarity. ## Implementation order Strategy: build the new path alongside the existing one. The destructive `DROP TABLE` step lands LAST, after end-to-end verification of the new architecture. Old endpoints are removed only after the browser is updated. **Additive (non-destructive):** - [x] 1. Add `redis:7-alpine` service to docker-compose.yml. New env var `REDIS_URL` in Settings. Smoke-test connectivity from `app`. - [x] 2. Migration `0009_ticker_universe.py` — creates the new table only, leaves existing portfolio tables untouched. - [x] 3. `app/services/ticker_universe.py` — add/refresh/evict logic. Batch-flush via Redis with a 5-min boundary; padding-on-flush at first stays off (toggle for when we reach ≥10 users). - [x] 3a. **Auth flip: passwordless.** Drop password_hash + email_verified (migration 0010). Collapse signup into login. Every auth is OTP. Threat model after Phase G makes passwords pure liability — see memory:cassandra_data_minimisation. - [x] 4. `app/services/portfolio_analysis.py` — ephemeral LLM prompt + call. Pie passed in via request body, held in a function-local variable, never written to DB or logs. Includes input sanitisation (prompt-injection defence, NaN/inf rejection, 200-position cap). - [x] 5. New router `app/routers/universe.py` with: - `GET /api/universe` - `GET /api/universe/sparkline/{ticker}` - `POST /api/portfolio/parse` - `POST /api/analyze` Added `GZipMiddleware` (≥500-byte threshold). Confirmed 70% compression on a 30-ticker universe payload. Old endpoints in `app/routers/api.py` stay live for now. - [x] 6. `app/templates/partials/portfolio.html` (panel shell) + `static/js/portfolio.js` (localStorage pie + universe fetch + P/L compute + analyze button). `upload.html` rewired to new `/api/portfolio/parse` endpoint. CSS additions: pf-pill, pf-actions, pf-analysis, pf-warn. - [x] 6a. Scheduler additions for Phase G: - `universe_flush_job` every 5 min (flushes Redis buffer → DB) - `universe_evict_job` daily at 00:15 UTC (60-day TTL prune) - `market_job` extended to fetch `config TOML ∪ ticker_universe` - [x] 7. Tests: universe add/evict (in service), parse-shape sanitisation (21 tests), unlinkability contract (structural assertion that the universe handler signature can't take a user-identifying parameter without failing CI). - [ ] 8. **End-to-end check (USER):** re-upload existing T212 CSV via new path, confirm pie renders correctly from localStorage with live prices, AI commentary works, no rows land in `positions` / `portfolio_snapshots`. **Destructive (only after step 8 passes):** - [x] 9. Migration `0011_drop_portfolio_tables.py` — dropped `positions` (299 rows), `portfolio_snapshots` (23 rows), `portfolios` (2 rows). Downgrade is one-way (structural only). - [x] 10. Removed old endpoints `POST /api/portfolios/upload`, `GET /api/portfolios`. Removed `portfolio_job.py` from scheduler. `market_job` already fetches "config TOML ∪ ticker_universe" (step 6a). `news_job` rewired to use `ticker_universe ∪ instrument_map` for per-ticker news. - [x] 11. Deleted `Portfolio` / `PortfolioSnapshot` / `Position` models from `app/models.py`. Removed `PortfolioSummary` / `PositionOut` from `app/schemas.py`. Removed `persist_pie` + `PersistResult` from `csv_import.py` (parser remains). **Polish:** - [ ] 12. `/privacy` page stating exactly what's held server-side and TTLs. - [ ] 13. Update README + plan file's review section. ## Out of scope (deferred) - **E2E encrypted sync of localStorage across devices.** Real demand from paying users would justify this. Mechanism: user-derived key from password (PBKDF2/Argon2 → KEK), encrypted pie blob stored on server, server can't decrypt. Phase H-ish. - **True PIR for prices.** Cryptographic overkill for retail SaaS. - **Anonymous billing.** Paddle requires an email. Accepted. ## Locked decisions (2026-05-16) 1. **Redis**: new compose service. Stores (a) the ephemeral pie during `/api/analyze` with a 5-min TTL, (b) the batch-buffer of new tickers awaiting universe flush. Slots in later for rate limits and Paddle webhook idempotency (Phase D). 2. **Sparklines lazy** — never bundled in `/api/universe`. Browser fetches `/api/universe/sparkline/{ticker}` on hover. 3. **Passive aging** — no heartbeat endpoint. `last_referenced_at` is bumped whenever a ticker appears in `/api/portfolio/parse` or `/api/analyze`. Eviction cron prunes rows with `last_referenced_at < now - 60 days`. Effect: a user who re-uploads their CSV monthly keeps their tickers alive in the universe; long-departed users' tickers age out naturally. 4. **No data migration of existing pies** — `positions` rows are dropped without backfilling `ticker_universe`. Users re-upload their CSV once after deploy; it lands in browser localStorage. ## Review section (to be filled after implementation) _TBD after sign-off + implementation._