phase G: data minimisation + passwordless auth + DeepSeek-first LLM

Server no longer holds portfolios. Holdings live in the browser
(localStorage); the server publishes an anonymous ticker_universe and a
gzipped /api/universe payload identical for every authenticated user, so
access patterns can't betray which tickers a user holds. AI commentary
is generated ephemerally from the browser-supplied pie and the cost
ledger row records no positions. Migrations 0009-0011 added the
universe table and dropped positions / portfolio_snapshots /
portfolios.

Authentication is now e-mail OTP only. Migration 0010 dropped
password_hash and email_verified (every active session is by
construction proof of email control). The /signup endpoint is gone;
signup and login share a single email-entry page. Email rendering is
HTML+plain-text multipart with a shared brand palette (app/branding.py)
asserted in sync with the CSS by a drift-detection test.

LLM provider defaults to DeepSeek-direct (cheaper, api.deepseek.com)
with OpenRouter as automatic fallback if DeepSeek fails. ai_log_job and
indicator_summary_job now iterate the two tones (NOVICE, INTERMEDIATE)
per cycle so the dashboard's tone toggle is instant; PROMPT_VERSION
bumped to 6 with an educational anti-TA / anti-gambling stance baked
into _CORE. NOVICE mode renders a curated glossary inline (CBOE VIX,
yield curve, HY OAS, etc.) with JS-positioned tooltips that survive
viewport edges and sticky bars. Model name and tokens hidden from the
user UI; still recorded in StrategicLog.model and AICall for admin.

Layout adds a sticky top nav, a sticky bottom markets bar (one chip per
exchange with status LED + headline index + 1d change), and
Phase H feedback reporting is queued in tasks/todo.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Giorgio Gilestro 2026-05-18 14:16:57 +01:00
parent 480fd311c5
commit 6e7f57c6b2
54 changed files with 5005 additions and 916 deletions

281
tasks/todo.md Normal file
View file

@ -0,0 +1,281 @@
# Phase G — Data-minimisation refactor
**Date opened:** 2026-05-16
**Status:** Planning. No code yet — awaiting sign-off on this doc.
## Goal
Drop "server holds your portfolio" from the threat model. After this phase,
Cassandra at rest knows: email, password hash, billing state, AI cost ledger,
a non-attributed set of tickers, and current market prices for those tickers.
It does **not** know which user holds what, at what cost, at what quantity.
Holdings live in the browser (localStorage). The server acts as a price proxy
that returns the **entire ticker universe** to every authenticated client, so
the request itself can't betray the user's pie. AI commentary is the only path
where holdings transit the server, and it does so **in-memory for the
duration of one LLM call**, never persisted.
## The shape
```
┌──────────────────────────────────────────────────────────┐
│ Browser (localStorage) │
│ • parsed pie: positions, qty, avg_cost │
│ • derived: P/L, sector tilt, sparkline cache │
└──────────────────────────────────────────────────────────┘
│ GET /api/universe (full payload, gzipped)
│ POST /api/portfolio/parse (CSV → parsed pie)
│ POST /api/analyze (pie + prices → AI text)
┌──────────────────────────────────────────────────────────┐
│ Server │
│ • users(email, hash, tier) │
│ • ticker_universe(ticker, currency, last_referenced_at) │
│ • quotes (already exists — keyed by ticker) │
│ • strategic_logs / indicator_summaries (shared, macro) │
│ • ai_calls (cost ledger, no holdings) │
│ ✗ NO positions table │
│ ✗ NO portfolio_snapshots table │
│ ✗ NO per-user holdings, ever │
└──────────────────────────────────────────────────────────┘
```
## Privacy properties this buys
1. **Holdings are not at rest**. Server never writes a row that says "user X
holds ticker Y". A full DB dump reveals only the *union* of all users'
tickers, with no attribution.
2. **Price-refresh requests are unlinkable**. Every authenticated user gets
the same payload (entire universe), so access logs / breach evidence can't
tell holdings from request bodies.
3. **AI analysis is ephemeral**. Holdings transit memory only during one LLM
call (~5-30s). No DB persistence, no logs of pie content.
## Privacy properties this does NOT buy
1. **Server briefly sees the pie** during `/api/portfolio/parse` (CSV upload)
and `/api/analyze`. This is "minutes-of-retention, in-memory" not
"zero-knowledge". GDPR-honest framing: *"shortest possible processing
window, no retention."*
2. **Universe-add timing leak**. If only one user is active when a new
ticker enters the universe, that ticker is linkable to that user via
timestamps. Mitigation in plan below.
3. **Email is still PII**. Paddle billing requires it; nothing to do about
that. Document clearly in privacy policy.
## Data model changes
### New tables
```python
class TickerUniverse(Base):
"""The set of public tickers Cassandra tracks. Populated as the union
of all user holdings, *without user attribution*."""
__tablename__ = "ticker_universe"
yahoo_ticker: Mapped[str] = mapped_column(String(32), primary_key=True)
currency: Mapped[str | None] = mapped_column(String(8))
first_seen_at: Mapped[datetime] = mapped_column(DateTime(timezone=True), default=utcnow)
# Refreshed by any user heartbeat that contains this ticker.
# When utcnow() - last_referenced_at > UNIVERSE_EVICTION_TTL, prune.
last_referenced_at: Mapped[datetime] = mapped_column(DateTime(timezone=True), default=utcnow)
```
### Removed tables (migration 0009)
- `positions`
- `portfolio_snapshots`
- `portfolios`
(The `Portfolio` model concept goes away. A user "having a portfolio" is now
purely a browser-localStorage concept.)
### Kept as-is
- `users`, `email_otps` — auth
- `quotes`, `quotes_daily` — price data
- `headlines`, `feeds` — news
- `strategic_logs`, `indicator_summaries`, `ai_calls` — macro AI (shared)
- `instrument_map` — T212 ↔ Yahoo resolution (admin-managed, read-only to user paths)
## New API surface
```
GET /api/universe
Auth: session/bearer required.
Returns the full universe with current prices, gzipped JSON:
{
"as_of": "2026-05-16T14:00:00Z",
"tickers": {
"AAPL": {"p": 234.56, "c": "USD", "d": {"1d": 0.5, "1m": 3.2, "1y": 18.4}},
"VWRL.L": {...},
...
}
}
Cache-Control: max-age=60. Browser refreshes once a minute.
GET /api/universe/sparkline/{ticker}
Auth required. Lazy-loaded on hover. Same shape as today.
POST /api/portfolio/parse
Auth required. multipart/form-data: file=<csv>.
Server: parses, resolves T212→Yahoo via instrument_map, adds resolved
tickers to ticker_universe (no user FK), returns parsed pie to browser.
Discards parsed pie before responding.
Response:
{
"positions": [
{"yahoo_ticker": "AAPL", "name": "Apple Inc",
"qty": 5, "avg_cost_gbp": 178.40, "currency": "USD"},
...
],
"base_currency": "GBP",
"warnings": ["3 unmapped tickers: ..."]
}
POST /api/analyze
Auth required. Body: {"positions": [...], "prices": {...}, "anchor": "..."}.
Server constructs prompt, calls LLM, returns commentary text.
No DB writes mentioning positions. ai_calls row written (no pie content).
Optional: cache commentary text keyed by sha256(positions canonical JSON)
so re-clicking is free. The hash is not reversible to holdings.
Response: {"content": "...", "model": "...", "generated_at": "..."}
POST /api/universe/heartbeat (optional, see "Open questions" below)
Browser periodically POSTs its localStorage ticker set so the server
can refresh last_referenced_at for those tickers. The "active client
bumps timestamps" pattern keeps the universe trimmed to actually-held
tickers.
```
### Endpoints removed
- `POST /api/portfolios/upload` (Phase B) — replaced by `/api/portfolio/parse`
- `GET /api/portfolio/{name}/summary` — gone; browser computes from
localStorage + universe prices
## Mitigation: universe-add timing leak
The naive "INSERT IGNORE on CSV parse" lets a passive observer link a
universe-row's `first_seen_at` to a specific user's upload time. Two
mitigations, layered:
1. **Batch additions.** New tickers don't enter `ticker_universe` directly
from the request handler. They're queued (in Redis or in an in-process
buffer) and flushed at fixed 5-minute boundaries. Multiple users' uploads
batch together; ordering within a flush is randomised.
2. **Padding.** On every flush, also re-touch `last_referenced_at` on N
random existing universe rows. This makes "row updated at flush time T"
not specifically informative about new tickers.
At low user counts (alpha), the leak is mathematically unavoidable; document
this in the alpha tester agreement and skip both mitigations until we have
≥10 concurrent users.
## Migration sequence
- [ ] **0009_drop_portfolio_tables.py** — drop `positions`,
`portfolio_snapshots`, `portfolios`. Upgrade extracts distinct tickers
from `positions` first to seed `ticker_universe`. Downgrade is
one-way (irreversible drop) — document this.
- [ ] **0010_ticker_universe.py** — create `ticker_universe` table.
Could be merged into 0009; keep separate for clarity.
## Implementation order
Strategy: build the new path alongside the existing one. The destructive
`DROP TABLE` step lands LAST, after end-to-end verification of the new
architecture. Old endpoints are removed only after the browser is updated.
**Additive (non-destructive):**
- [x] 1. Add `redis:7-alpine` service to docker-compose.yml. New env var
`REDIS_URL` in Settings. Smoke-test connectivity from `app`.
- [x] 2. Migration `0009_ticker_universe.py` — creates the new table only,
leaves existing portfolio tables untouched.
- [x] 3. `app/services/ticker_universe.py` — add/refresh/evict logic.
Batch-flush via Redis with a 5-min boundary; padding-on-flush at
first stays off (toggle for when we reach ≥10 users).
- [x] 3a. **Auth flip: passwordless.** Drop password_hash + email_verified
(migration 0010). Collapse signup into login. Every auth is OTP.
Threat model after Phase G makes passwords pure liability — see
memory:cassandra_data_minimisation.
- [x] 4. `app/services/portfolio_analysis.py` — ephemeral LLM prompt +
call. Pie passed in via request body, held in a function-local
variable, never written to DB or logs. Includes input sanitisation
(prompt-injection defence, NaN/inf rejection, 200-position cap).
- [x] 5. New router `app/routers/universe.py` with:
- `GET /api/universe`
- `GET /api/universe/sparkline/{ticker}`
- `POST /api/portfolio/parse`
- `POST /api/analyze`
Added `GZipMiddleware` (≥500-byte threshold). Confirmed 70%
compression on a 30-ticker universe payload. Old endpoints in
`app/routers/api.py` stay live for now.
- [x] 6. `app/templates/partials/portfolio.html` (panel shell) +
`static/js/portfolio.js` (localStorage pie + universe fetch +
P/L compute + analyze button). `upload.html` rewired to new
`/api/portfolio/parse` endpoint. CSS additions: pf-pill,
pf-actions, pf-analysis, pf-warn.
- [x] 6a. Scheduler additions for Phase G:
- `universe_flush_job` every 5 min (flushes Redis buffer → DB)
- `universe_evict_job` daily at 00:15 UTC (60-day TTL prune)
- `market_job` extended to fetch `config TOML ticker_universe`
- [x] 7. Tests: universe add/evict (in service), parse-shape sanitisation
(21 tests), unlinkability contract (structural assertion that
the universe handler signature can't take a user-identifying
parameter without failing CI).
- [ ] 8. **End-to-end check (USER):** re-upload existing T212 CSV via
new path, confirm pie renders correctly from localStorage with
live prices, AI commentary works, no rows land in `positions` /
`portfolio_snapshots`.
**Destructive (only after step 8 passes):**
- [x] 9. Migration `0011_drop_portfolio_tables.py` — dropped
`positions` (299 rows), `portfolio_snapshots` (23 rows),
`portfolios` (2 rows). Downgrade is one-way (structural only).
- [x] 10. Removed old endpoints `POST /api/portfolios/upload`,
`GET /api/portfolios`. Removed `portfolio_job.py` from
scheduler. `market_job` already fetches "config TOML
ticker_universe" (step 6a). `news_job` rewired to use
`ticker_universe instrument_map` for per-ticker news.
- [x] 11. Deleted `Portfolio` / `PortfolioSnapshot` / `Position` models
from `app/models.py`. Removed `PortfolioSummary` / `PositionOut`
from `app/schemas.py`. Removed `persist_pie` + `PersistResult`
from `csv_import.py` (parser remains).
**Polish:**
- [ ] 12. `/privacy` page stating exactly what's held server-side and TTLs.
- [ ] 13. Update README + plan file's review section.
## Out of scope (deferred)
- **E2E encrypted sync of localStorage across devices.** Real demand from
paying users would justify this. Mechanism: user-derived key from
password (PBKDF2/Argon2 → KEK), encrypted pie blob stored on server,
server can't decrypt. Phase H-ish.
- **True PIR for prices.** Cryptographic overkill for retail SaaS.
- **Anonymous billing.** Paddle requires an email. Accepted.
## Locked decisions (2026-05-16)
1. **Redis**: new compose service. Stores (a) the ephemeral pie during
`/api/analyze` with a 5-min TTL, (b) the batch-buffer of new tickers
awaiting universe flush. Slots in later for rate limits and Paddle
webhook idempotency (Phase D).
2. **Sparklines lazy** — never bundled in `/api/universe`. Browser fetches
`/api/universe/sparkline/{ticker}` on hover.
3. **Passive aging** — no heartbeat endpoint. `last_referenced_at` is bumped
whenever a ticker appears in `/api/portfolio/parse` or `/api/analyze`.
Eviction cron prunes rows with `last_referenced_at < now - 60 days`.
Effect: a user who re-uploads their CSV monthly keeps their tickers
alive in the universe; long-departed users' tickers age out naturally.
4. **No data migration of existing pies**`positions` rows are dropped
without backfilling `ticker_universe`. Users re-upload their CSV once
after deploy; it lands in browser localStorage.
## Review section (to be filled after implementation)
_TBD after sign-off + implementation._