read.markets/tasks/todo.md
Giorgio Gilestro 6e7f57c6b2 phase G: data minimisation + passwordless auth + DeepSeek-first LLM
Server no longer holds portfolios. Holdings live in the browser
(localStorage); the server publishes an anonymous ticker_universe and a
gzipped /api/universe payload identical for every authenticated user, so
access patterns can't betray which tickers a user holds. AI commentary
is generated ephemerally from the browser-supplied pie and the cost
ledger row records no positions. Migrations 0009-0011 added the
universe table and dropped positions / portfolio_snapshots /
portfolios.

Authentication is now e-mail OTP only. Migration 0010 dropped
password_hash and email_verified (every active session is by
construction proof of email control). The /signup endpoint is gone;
signup and login share a single email-entry page. Email rendering is
HTML+plain-text multipart with a shared brand palette (app/branding.py)
asserted in sync with the CSS by a drift-detection test.

LLM provider defaults to DeepSeek-direct (cheaper, api.deepseek.com)
with OpenRouter as automatic fallback if DeepSeek fails. ai_log_job and
indicator_summary_job now iterate the two tones (NOVICE, INTERMEDIATE)
per cycle so the dashboard's tone toggle is instant; PROMPT_VERSION
bumped to 6 with an educational anti-TA / anti-gambling stance baked
into _CORE. NOVICE mode renders a curated glossary inline (CBOE VIX,
yield curve, HY OAS, etc.) with JS-positioned tooltips that survive
viewport edges and sticky bars. Model name and tokens hidden from the
user UI; still recorded in StrategicLog.model and AICall for admin.

Layout adds a sticky top nav, a sticky bottom markets bar (one chip per
exchange with status LED + headline index + 1d change), and
Phase H feedback reporting is queued in tasks/todo.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 14:16:57 +01:00

13 KiB
Raw Blame History

Phase G — Data-minimisation refactor

Date opened: 2026-05-16 Status: Planning. No code yet — awaiting sign-off on this doc.

Goal

Drop "server holds your portfolio" from the threat model. After this phase, Cassandra at rest knows: email, password hash, billing state, AI cost ledger, a non-attributed set of tickers, and current market prices for those tickers. It does not know which user holds what, at what cost, at what quantity.

Holdings live in the browser (localStorage). The server acts as a price proxy that returns the entire ticker universe to every authenticated client, so the request itself can't betray the user's pie. AI commentary is the only path where holdings transit the server, and it does so in-memory for the duration of one LLM call, never persisted.

The shape

   ┌──────────────────────────────────────────────────────────┐
   │ Browser (localStorage)                                   │
   │  • parsed pie: positions, qty, avg_cost                  │
   │  • derived: P/L, sector tilt, sparkline cache            │
   └──────────────────────────────────────────────────────────┘
            │ GET /api/universe (full payload, gzipped)
            │ POST /api/portfolio/parse (CSV → parsed pie)
            │ POST /api/analyze (pie + prices → AI text)
            ▼
   ┌──────────────────────────────────────────────────────────┐
   │ Server                                                   │
   │  • users(email, hash, tier)                              │
   │  • ticker_universe(ticker, currency, last_referenced_at) │
   │  • quotes (already exists — keyed by ticker)             │
   │  • strategic_logs / indicator_summaries (shared, macro)  │
   │  • ai_calls (cost ledger, no holdings)                   │
   │  ✗ NO positions table                                    │
   │  ✗ NO portfolio_snapshots table                          │
   │  ✗ NO per-user holdings, ever                            │
   └──────────────────────────────────────────────────────────┘

Privacy properties this buys

  1. Holdings are not at rest. Server never writes a row that says "user X holds ticker Y". A full DB dump reveals only the union of all users' tickers, with no attribution.
  2. Price-refresh requests are unlinkable. Every authenticated user gets the same payload (entire universe), so access logs / breach evidence can't tell holdings from request bodies.
  3. AI analysis is ephemeral. Holdings transit memory only during one LLM call (~5-30s). No DB persistence, no logs of pie content.

Privacy properties this does NOT buy

  1. Server briefly sees the pie during /api/portfolio/parse (CSV upload) and /api/analyze. This is "minutes-of-retention, in-memory" not "zero-knowledge". GDPR-honest framing: "shortest possible processing window, no retention."
  2. Universe-add timing leak. If only one user is active when a new ticker enters the universe, that ticker is linkable to that user via timestamps. Mitigation in plan below.
  3. Email is still PII. Paddle billing requires it; nothing to do about that. Document clearly in privacy policy.

Data model changes

New tables

class TickerUniverse(Base):
    """The set of public tickers Cassandra tracks. Populated as the union
    of all user holdings, *without user attribution*."""
    __tablename__ = "ticker_universe"
    yahoo_ticker: Mapped[str] = mapped_column(String(32), primary_key=True)
    currency: Mapped[str | None] = mapped_column(String(8))
    first_seen_at: Mapped[datetime] = mapped_column(DateTime(timezone=True), default=utcnow)
    # Refreshed by any user heartbeat that contains this ticker.
    # When utcnow() - last_referenced_at > UNIVERSE_EVICTION_TTL, prune.
    last_referenced_at: Mapped[datetime] = mapped_column(DateTime(timezone=True), default=utcnow)

Removed tables (migration 0009)

  • positions
  • portfolio_snapshots
  • portfolios

(The Portfolio model concept goes away. A user "having a portfolio" is now purely a browser-localStorage concept.)

Kept as-is

  • users, email_otps — auth
  • quotes, quotes_daily — price data
  • headlines, feeds — news
  • strategic_logs, indicator_summaries, ai_calls — macro AI (shared)
  • instrument_map — T212 ↔ Yahoo resolution (admin-managed, read-only to user paths)

New API surface

GET /api/universe
    Auth: session/bearer required.
    Returns the full universe with current prices, gzipped JSON:
      {
        "as_of": "2026-05-16T14:00:00Z",
        "tickers": {
          "AAPL": {"p": 234.56, "c": "USD", "d": {"1d": 0.5, "1m": 3.2, "1y": 18.4}},
          "VWRL.L": {...},
          ...
        }
      }
    Cache-Control: max-age=60. Browser refreshes once a minute.

GET /api/universe/sparkline/{ticker}
    Auth required. Lazy-loaded on hover. Same shape as today.

POST /api/portfolio/parse
    Auth required. multipart/form-data: file=<csv>.
    Server: parses, resolves T212→Yahoo via instrument_map, adds resolved
    tickers to ticker_universe (no user FK), returns parsed pie to browser.
    Discards parsed pie before responding.
    Response:
      {
        "positions": [
          {"yahoo_ticker": "AAPL", "name": "Apple Inc",
           "qty": 5, "avg_cost_gbp": 178.40, "currency": "USD"},
          ...
        ],
        "base_currency": "GBP",
        "warnings": ["3 unmapped tickers: ..."]
      }

POST /api/analyze
    Auth required. Body: {"positions": [...], "prices": {...}, "anchor": "..."}.
    Server constructs prompt, calls LLM, returns commentary text.
    No DB writes mentioning positions. ai_calls row written (no pie content).
    Optional: cache commentary text keyed by sha256(positions canonical JSON)
    so re-clicking is free. The hash is not reversible to holdings.
    Response: {"content": "...", "model": "...", "generated_at": "..."}

POST /api/universe/heartbeat   (optional, see "Open questions" below)
    Browser periodically POSTs its localStorage ticker set so the server
    can refresh last_referenced_at for those tickers. The "active client
    bumps timestamps" pattern keeps the universe trimmed to actually-held
    tickers.

Endpoints removed

  • POST /api/portfolios/upload (Phase B) — replaced by /api/portfolio/parse
  • GET /api/portfolio/{name}/summary — gone; browser computes from localStorage + universe prices

Mitigation: universe-add timing leak

The naive "INSERT IGNORE on CSV parse" lets a passive observer link a universe-row's first_seen_at to a specific user's upload time. Two mitigations, layered:

  1. Batch additions. New tickers don't enter ticker_universe directly from the request handler. They're queued (in Redis or in an in-process buffer) and flushed at fixed 5-minute boundaries. Multiple users' uploads batch together; ordering within a flush is randomised.
  2. Padding. On every flush, also re-touch last_referenced_at on N random existing universe rows. This makes "row updated at flush time T" not specifically informative about new tickers.

At low user counts (alpha), the leak is mathematically unavoidable; document this in the alpha tester agreement and skip both mitigations until we have ≥10 concurrent users.

Migration sequence

  • 0009_drop_portfolio_tables.py — drop positions, portfolio_snapshots, portfolios. Upgrade extracts distinct tickers from positions first to seed ticker_universe. Downgrade is one-way (irreversible drop) — document this.
  • 0010_ticker_universe.py — create ticker_universe table. Could be merged into 0009; keep separate for clarity.

Implementation order

Strategy: build the new path alongside the existing one. The destructive DROP TABLE step lands LAST, after end-to-end verification of the new architecture. Old endpoints are removed only after the browser is updated.

Additive (non-destructive):

  • 1. Add redis:7-alpine service to docker-compose.yml. New env var REDIS_URL in Settings. Smoke-test connectivity from app.
  • 2. Migration 0009_ticker_universe.py — creates the new table only, leaves existing portfolio tables untouched.
  • 3. app/services/ticker_universe.py — add/refresh/evict logic. Batch-flush via Redis with a 5-min boundary; padding-on-flush at first stays off (toggle for when we reach ≥10 users).
  • 3a. Auth flip: passwordless. Drop password_hash + email_verified (migration 0010). Collapse signup into login. Every auth is OTP. Threat model after Phase G makes passwords pure liability — see memory:cassandra_data_minimisation.
  • 4. app/services/portfolio_analysis.py — ephemeral LLM prompt + call. Pie passed in via request body, held in a function-local variable, never written to DB or logs. Includes input sanitisation (prompt-injection defence, NaN/inf rejection, 200-position cap).
  • 5. New router app/routers/universe.py with: - GET /api/universe - GET /api/universe/sparkline/{ticker} - POST /api/portfolio/parse - POST /api/analyze Added GZipMiddleware (≥500-byte threshold). Confirmed 70% compression on a 30-ticker universe payload. Old endpoints in app/routers/api.py stay live for now.
  • 6. app/templates/partials/portfolio.html (panel shell) + static/js/portfolio.js (localStorage pie + universe fetch + P/L compute + analyze button). upload.html rewired to new /api/portfolio/parse endpoint. CSS additions: pf-pill, pf-actions, pf-analysis, pf-warn.
  • 6a. Scheduler additions for Phase G: - universe_flush_job every 5 min (flushes Redis buffer → DB) - universe_evict_job daily at 00:15 UTC (60-day TTL prune) - market_job extended to fetch config TOML ticker_universe
  • 7. Tests: universe add/evict (in service), parse-shape sanitisation (21 tests), unlinkability contract (structural assertion that the universe handler signature can't take a user-identifying parameter without failing CI).
  • 8. End-to-end check (USER): re-upload existing T212 CSV via new path, confirm pie renders correctly from localStorage with live prices, AI commentary works, no rows land in positions / portfolio_snapshots.

Destructive (only after step 8 passes):

  • 9. Migration 0011_drop_portfolio_tables.py — dropped positions (299 rows), portfolio_snapshots (23 rows), portfolios (2 rows). Downgrade is one-way (structural only).
  • 10. Removed old endpoints POST /api/portfolios/upload, GET /api/portfolios. Removed portfolio_job.py from scheduler. market_job already fetches "config TOML ticker_universe" (step 6a). news_job rewired to use ticker_universe instrument_map for per-ticker news.
  • 11. Deleted Portfolio / PortfolioSnapshot / Position models from app/models.py. Removed PortfolioSummary / PositionOut from app/schemas.py. Removed persist_pie + PersistResult from csv_import.py (parser remains).

Polish:

  • 12. /privacy page stating exactly what's held server-side and TTLs.
  • 13. Update README + plan file's review section.

Out of scope (deferred)

  • E2E encrypted sync of localStorage across devices. Real demand from paying users would justify this. Mechanism: user-derived key from password (PBKDF2/Argon2 → KEK), encrypted pie blob stored on server, server can't decrypt. Phase H-ish.
  • True PIR for prices. Cryptographic overkill for retail SaaS.
  • Anonymous billing. Paddle requires an email. Accepted.

Locked decisions (2026-05-16)

  1. Redis: new compose service. Stores (a) the ephemeral pie during /api/analyze with a 5-min TTL, (b) the batch-buffer of new tickers awaiting universe flush. Slots in later for rate limits and Paddle webhook idempotency (Phase D).
  2. Sparklines lazy — never bundled in /api/universe. Browser fetches /api/universe/sparkline/{ticker} on hover.
  3. Passive aging — no heartbeat endpoint. last_referenced_at is bumped whenever a ticker appears in /api/portfolio/parse or /api/analyze. Eviction cron prunes rows with last_referenced_at < now - 60 days. Effect: a user who re-uploads their CSV monthly keeps their tickers alive in the universe; long-departed users' tickers age out naturally.
  4. No data migration of existing piespositions rows are dropped without backfilling ticker_universe. Users re-upload their CSV once after deploy; it lands in browser localStorage.

Review section (to be filled after implementation)

TBD after sign-off + implementation.