read.markets/tasks/todo.md
Giorgio Gilestro 6e7f57c6b2 phase G: data minimisation + passwordless auth + DeepSeek-first LLM
Server no longer holds portfolios. Holdings live in the browser
(localStorage); the server publishes an anonymous ticker_universe and a
gzipped /api/universe payload identical for every authenticated user, so
access patterns can't betray which tickers a user holds. AI commentary
is generated ephemerally from the browser-supplied pie and the cost
ledger row records no positions. Migrations 0009-0011 added the
universe table and dropped positions / portfolio_snapshots /
portfolios.

Authentication is now e-mail OTP only. Migration 0010 dropped
password_hash and email_verified (every active session is by
construction proof of email control). The /signup endpoint is gone;
signup and login share a single email-entry page. Email rendering is
HTML+plain-text multipart with a shared brand palette (app/branding.py)
asserted in sync with the CSS by a drift-detection test.

LLM provider defaults to DeepSeek-direct (cheaper, api.deepseek.com)
with OpenRouter as automatic fallback if DeepSeek fails. ai_log_job and
indicator_summary_job now iterate the two tones (NOVICE, INTERMEDIATE)
per cycle so the dashboard's tone toggle is instant; PROMPT_VERSION
bumped to 6 with an educational anti-TA / anti-gambling stance baked
into _CORE. NOVICE mode renders a curated glossary inline (CBOE VIX,
yield curve, HY OAS, etc.) with JS-positioned tooltips that survive
viewport edges and sticky bars. Model name and tokens hidden from the
user UI; still recorded in StrategicLog.model and AICall for admin.

Layout adds a sticky top nav, a sticky bottom markets bar (one chip per
exchange with status LED + headline index + 1d change), and
Phase H feedback reporting is queued in tasks/todo.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 14:16:57 +01:00

281 lines
13 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Phase G — Data-minimisation refactor
**Date opened:** 2026-05-16
**Status:** Planning. No code yet — awaiting sign-off on this doc.
## Goal
Drop "server holds your portfolio" from the threat model. After this phase,
Cassandra at rest knows: email, password hash, billing state, AI cost ledger,
a non-attributed set of tickers, and current market prices for those tickers.
It does **not** know which user holds what, at what cost, at what quantity.
Holdings live in the browser (localStorage). The server acts as a price proxy
that returns the **entire ticker universe** to every authenticated client, so
the request itself can't betray the user's pie. AI commentary is the only path
where holdings transit the server, and it does so **in-memory for the
duration of one LLM call**, never persisted.
## The shape
```
┌──────────────────────────────────────────────────────────┐
│ Browser (localStorage) │
│ • parsed pie: positions, qty, avg_cost │
│ • derived: P/L, sector tilt, sparkline cache │
└──────────────────────────────────────────────────────────┘
│ GET /api/universe (full payload, gzipped)
│ POST /api/portfolio/parse (CSV → parsed pie)
│ POST /api/analyze (pie + prices → AI text)
┌──────────────────────────────────────────────────────────┐
│ Server │
│ • users(email, hash, tier) │
│ • ticker_universe(ticker, currency, last_referenced_at) │
│ • quotes (already exists — keyed by ticker) │
│ • strategic_logs / indicator_summaries (shared, macro) │
│ • ai_calls (cost ledger, no holdings) │
│ ✗ NO positions table │
│ ✗ NO portfolio_snapshots table │
│ ✗ NO per-user holdings, ever │
└──────────────────────────────────────────────────────────┘
```
## Privacy properties this buys
1. **Holdings are not at rest**. Server never writes a row that says "user X
holds ticker Y". A full DB dump reveals only the *union* of all users'
tickers, with no attribution.
2. **Price-refresh requests are unlinkable**. Every authenticated user gets
the same payload (entire universe), so access logs / breach evidence can't
tell holdings from request bodies.
3. **AI analysis is ephemeral**. Holdings transit memory only during one LLM
call (~5-30s). No DB persistence, no logs of pie content.
## Privacy properties this does NOT buy
1. **Server briefly sees the pie** during `/api/portfolio/parse` (CSV upload)
and `/api/analyze`. This is "minutes-of-retention, in-memory" not
"zero-knowledge". GDPR-honest framing: *"shortest possible processing
window, no retention."*
2. **Universe-add timing leak**. If only one user is active when a new
ticker enters the universe, that ticker is linkable to that user via
timestamps. Mitigation in plan below.
3. **Email is still PII**. Paddle billing requires it; nothing to do about
that. Document clearly in privacy policy.
## Data model changes
### New tables
```python
class TickerUniverse(Base):
"""The set of public tickers Cassandra tracks. Populated as the union
of all user holdings, *without user attribution*."""
__tablename__ = "ticker_universe"
yahoo_ticker: Mapped[str] = mapped_column(String(32), primary_key=True)
currency: Mapped[str | None] = mapped_column(String(8))
first_seen_at: Mapped[datetime] = mapped_column(DateTime(timezone=True), default=utcnow)
# Refreshed by any user heartbeat that contains this ticker.
# When utcnow() - last_referenced_at > UNIVERSE_EVICTION_TTL, prune.
last_referenced_at: Mapped[datetime] = mapped_column(DateTime(timezone=True), default=utcnow)
```
### Removed tables (migration 0009)
- `positions`
- `portfolio_snapshots`
- `portfolios`
(The `Portfolio` model concept goes away. A user "having a portfolio" is now
purely a browser-localStorage concept.)
### Kept as-is
- `users`, `email_otps` — auth
- `quotes`, `quotes_daily` — price data
- `headlines`, `feeds` — news
- `strategic_logs`, `indicator_summaries`, `ai_calls` — macro AI (shared)
- `instrument_map` — T212 ↔ Yahoo resolution (admin-managed, read-only to user paths)
## New API surface
```
GET /api/universe
Auth: session/bearer required.
Returns the full universe with current prices, gzipped JSON:
{
"as_of": "2026-05-16T14:00:00Z",
"tickers": {
"AAPL": {"p": 234.56, "c": "USD", "d": {"1d": 0.5, "1m": 3.2, "1y": 18.4}},
"VWRL.L": {...},
...
}
}
Cache-Control: max-age=60. Browser refreshes once a minute.
GET /api/universe/sparkline/{ticker}
Auth required. Lazy-loaded on hover. Same shape as today.
POST /api/portfolio/parse
Auth required. multipart/form-data: file=<csv>.
Server: parses, resolves T212→Yahoo via instrument_map, adds resolved
tickers to ticker_universe (no user FK), returns parsed pie to browser.
Discards parsed pie before responding.
Response:
{
"positions": [
{"yahoo_ticker": "AAPL", "name": "Apple Inc",
"qty": 5, "avg_cost_gbp": 178.40, "currency": "USD"},
...
],
"base_currency": "GBP",
"warnings": ["3 unmapped tickers: ..."]
}
POST /api/analyze
Auth required. Body: {"positions": [...], "prices": {...}, "anchor": "..."}.
Server constructs prompt, calls LLM, returns commentary text.
No DB writes mentioning positions. ai_calls row written (no pie content).
Optional: cache commentary text keyed by sha256(positions canonical JSON)
so re-clicking is free. The hash is not reversible to holdings.
Response: {"content": "...", "model": "...", "generated_at": "..."}
POST /api/universe/heartbeat (optional, see "Open questions" below)
Browser periodically POSTs its localStorage ticker set so the server
can refresh last_referenced_at for those tickers. The "active client
bumps timestamps" pattern keeps the universe trimmed to actually-held
tickers.
```
### Endpoints removed
- `POST /api/portfolios/upload` (Phase B) — replaced by `/api/portfolio/parse`
- `GET /api/portfolio/{name}/summary` — gone; browser computes from
localStorage + universe prices
## Mitigation: universe-add timing leak
The naive "INSERT IGNORE on CSV parse" lets a passive observer link a
universe-row's `first_seen_at` to a specific user's upload time. Two
mitigations, layered:
1. **Batch additions.** New tickers don't enter `ticker_universe` directly
from the request handler. They're queued (in Redis or in an in-process
buffer) and flushed at fixed 5-minute boundaries. Multiple users' uploads
batch together; ordering within a flush is randomised.
2. **Padding.** On every flush, also re-touch `last_referenced_at` on N
random existing universe rows. This makes "row updated at flush time T"
not specifically informative about new tickers.
At low user counts (alpha), the leak is mathematically unavoidable; document
this in the alpha tester agreement and skip both mitigations until we have
≥10 concurrent users.
## Migration sequence
- [ ] **0009_drop_portfolio_tables.py** — drop `positions`,
`portfolio_snapshots`, `portfolios`. Upgrade extracts distinct tickers
from `positions` first to seed `ticker_universe`. Downgrade is
one-way (irreversible drop) — document this.
- [ ] **0010_ticker_universe.py** — create `ticker_universe` table.
Could be merged into 0009; keep separate for clarity.
## Implementation order
Strategy: build the new path alongside the existing one. The destructive
`DROP TABLE` step lands LAST, after end-to-end verification of the new
architecture. Old endpoints are removed only after the browser is updated.
**Additive (non-destructive):**
- [x] 1. Add `redis:7-alpine` service to docker-compose.yml. New env var
`REDIS_URL` in Settings. Smoke-test connectivity from `app`.
- [x] 2. Migration `0009_ticker_universe.py` — creates the new table only,
leaves existing portfolio tables untouched.
- [x] 3. `app/services/ticker_universe.py` — add/refresh/evict logic.
Batch-flush via Redis with a 5-min boundary; padding-on-flush at
first stays off (toggle for when we reach ≥10 users).
- [x] 3a. **Auth flip: passwordless.** Drop password_hash + email_verified
(migration 0010). Collapse signup into login. Every auth is OTP.
Threat model after Phase G makes passwords pure liability — see
memory:cassandra_data_minimisation.
- [x] 4. `app/services/portfolio_analysis.py` — ephemeral LLM prompt +
call. Pie passed in via request body, held in a function-local
variable, never written to DB or logs. Includes input sanitisation
(prompt-injection defence, NaN/inf rejection, 200-position cap).
- [x] 5. New router `app/routers/universe.py` with:
- `GET /api/universe`
- `GET /api/universe/sparkline/{ticker}`
- `POST /api/portfolio/parse`
- `POST /api/analyze`
Added `GZipMiddleware` (≥500-byte threshold). Confirmed 70%
compression on a 30-ticker universe payload. Old endpoints in
`app/routers/api.py` stay live for now.
- [x] 6. `app/templates/partials/portfolio.html` (panel shell) +
`static/js/portfolio.js` (localStorage pie + universe fetch +
P/L compute + analyze button). `upload.html` rewired to new
`/api/portfolio/parse` endpoint. CSS additions: pf-pill,
pf-actions, pf-analysis, pf-warn.
- [x] 6a. Scheduler additions for Phase G:
- `universe_flush_job` every 5 min (flushes Redis buffer → DB)
- `universe_evict_job` daily at 00:15 UTC (60-day TTL prune)
- `market_job` extended to fetch `config TOML ticker_universe`
- [x] 7. Tests: universe add/evict (in service), parse-shape sanitisation
(21 tests), unlinkability contract (structural assertion that
the universe handler signature can't take a user-identifying
parameter without failing CI).
- [ ] 8. **End-to-end check (USER):** re-upload existing T212 CSV via
new path, confirm pie renders correctly from localStorage with
live prices, AI commentary works, no rows land in `positions` /
`portfolio_snapshots`.
**Destructive (only after step 8 passes):**
- [x] 9. Migration `0011_drop_portfolio_tables.py` — dropped
`positions` (299 rows), `portfolio_snapshots` (23 rows),
`portfolios` (2 rows). Downgrade is one-way (structural only).
- [x] 10. Removed old endpoints `POST /api/portfolios/upload`,
`GET /api/portfolios`. Removed `portfolio_job.py` from
scheduler. `market_job` already fetches "config TOML
ticker_universe" (step 6a). `news_job` rewired to use
`ticker_universe instrument_map` for per-ticker news.
- [x] 11. Deleted `Portfolio` / `PortfolioSnapshot` / `Position` models
from `app/models.py`. Removed `PortfolioSummary` / `PositionOut`
from `app/schemas.py`. Removed `persist_pie` + `PersistResult`
from `csv_import.py` (parser remains).
**Polish:**
- [ ] 12. `/privacy` page stating exactly what's held server-side and TTLs.
- [ ] 13. Update README + plan file's review section.
## Out of scope (deferred)
- **E2E encrypted sync of localStorage across devices.** Real demand from
paying users would justify this. Mechanism: user-derived key from
password (PBKDF2/Argon2 → KEK), encrypted pie blob stored on server,
server can't decrypt. Phase H-ish.
- **True PIR for prices.** Cryptographic overkill for retail SaaS.
- **Anonymous billing.** Paddle requires an email. Accepted.
## Locked decisions (2026-05-16)
1. **Redis**: new compose service. Stores (a) the ephemeral pie during
`/api/analyze` with a 5-min TTL, (b) the batch-buffer of new tickers
awaiting universe flush. Slots in later for rate limits and Paddle
webhook idempotency (Phase D).
2. **Sparklines lazy** — never bundled in `/api/universe`. Browser fetches
`/api/universe/sparkline/{ticker}` on hover.
3. **Passive aging** — no heartbeat endpoint. `last_referenced_at` is bumped
whenever a ticker appears in `/api/portfolio/parse` or `/api/analyze`.
Eviction cron prunes rows with `last_referenced_at < now - 60 days`.
Effect: a user who re-uploads their CSV monthly keeps their tickers
alive in the universe; long-departed users' tickers age out naturally.
4. **No data migration of existing pies**`positions` rows are dropped
without backfilling `ticker_universe`. Users re-upload their CSV once
after deploy; it lands in browser localStorage.
## Review section (to be filled after implementation)
_TBD after sign-off + implementation._