# LLM-fallback CSV parser — Design Spec **Date:** 2026-05-27 **Status:** Draft — pending implementation plan ## Context Today the only supported broker import is Trading 212. `parse_t212_csv` expects T212's exact column set (`Slice`, `Owned quantity`, etc.) and raises `CSVImportError` on anything else. Every non-T212 user hits a wall at onboarding. Rather than write a hand-rolled parser per broker (IBKR, Vanguard, Fidelity, Schwab, eToro, Degiro, …) — and chase format drift forever — we use an LLM as a transparent fallback. The LLM never sees holdings as data; it only looks at **headers plus a handful of sample rows** and returns a JSON column-mapping. Our existing Python code does the row iteration. The first time a broker format appears, the LLM produces a mapping. We fingerprint the format (sha256 of normalized headers) and cache the mapping in a new `csv_format_templates` table. Every subsequent upload of the same format — by any user — replays the cached mapping deterministically, with no LLM call. The cache row stores the header row and a single anonymous sample data row (the first row from the originating upload, verbatim). No user identifier is recorded — the row is not linked back to whoever uploaded it. The purpose of the sample is to give the operator material to look at when designing future native parsers; this collection is **passive learning only**, the system never attempts to author or modify parser code automatically. Portfolio import is already advertised as a paid-only feature; we make that explicit at the route level as part of this work. ## Goals - Accept CSV exports from any broker, not just T212. - Pay the LLM cost only once per **format**, not once per user. - Never persist user holdings on the server (already a system-wide invariant). - Surface the same response shape to the browser regardless of which parser branch ran — no client changes beyond a copy tweak. ## Non-goals - Per-broker UI customisation. The drop-zone stays generic. - A human admin queue for reviewing LLM-discovered formats. Operator can inspect rows directly in the DB if curious. - **Auto-promoting learned formats to native parsers.** The operator will hand-write any native parser by looking at the collected sample rows. The system never writes or modifies code. - Self-healing or auto-evicting stale cache entries. If a broker silently changes their export shape under us, the cached mapping will start producing parse errors; the operator deletes the row manually. We do not invalidate cache entries automatically. - Multi-stage / verification LLM passes. One call per first-time format. ## Architecture ``` POST /api/portfolio/parse (paid-only) ├─ parse_t212_csv(raw) ── happy path, unchanged │ └─ CSVImportError ↴ │ ├─ parse_with_llm(raw, session) │ ├─ detect delimiter + preamble offset │ ├─ fingerprint = sha256(normalised headers) │ ├─ SELECT csv_format_templates WHERE fingerprint=? │ │ ├─ HIT → apply mapping (bump use_count/last_used_at after successful parse) │ │ └─ MISS → openrouter.call_llm(headers + 3-5 sample rows) │ │ → validate mapping │ │ → INSERT csv_format_templates │ │ → apply mapping │ └─ returns ParsedPie (same shape as T212 path) │ └─ resolve_slice → upsert_tickers → inline Yahoo fetch → JSON response (existing pipeline, unchanged) ``` ### Why column-mapping, not full extraction We pass the LLM only **headers plus 3–5 sample rows**, not the full CSV. The LLM returns column names, not transcribed numbers. Three benefits: 1. **Safety** — LLMs hallucinate digits; they don't hallucinate column names that aren't there. Mapping validation can verify every named column exists in the actual header row. 2. **Cost** — prompt is ~1 KB regardless of portfolio size. 3. **Cacheability** — the mapping IS the cache. Replay is deterministic Python, no LLM in the loop on re-imports. ### Why global cache, not per-user The column structure of an IBKR Activity Statement is a property of IBKR, not of any individual user. The cache row contains no user identifier — the sample data row is stored verbatim but anonymously, with nothing linking it to the uploader. Global cache is strictly better: faster onboarding for the second IBKR user, and the collected samples form a small, useful corpus for hand-writing native parsers later. ## Data model New table `csv_format_templates`: | Column | Type | Notes | |---|---|---| | `id` | int PK | | | `fingerprint` | `VARCHAR(64) UNIQUE NOT NULL` | sha256 hex of normalised header tuple | | `headers` | JSON | List of strings — actual header row from the upload | | `sample_row` | JSON | First data row from the originating upload, verbatim. Not linked to any user. | | `mapping` | JSON | `{ticker_col, qty_col, name_col, cost_col, currency_col}` | | `preamble_rows` | INT NOT NULL DEFAULT 0 | Non-data lines before the header row | | `delimiter` | CHAR(1) NOT NULL DEFAULT ',' | | | `broker_label` | VARCHAR(128) | LLM-identified label, e.g. "Interactive Brokers Activity Statement" | | `first_seen_at` | DATETIME(tz) NOT NULL | When the format was first cached | | `use_count` | INT NOT NULL DEFAULT 1 | Bumped on each successful cache hit | | `last_used_at` | DATETIME(tz) NOT NULL | | | `llm_model` | VARCHAR(64) | Provenance of the initial extraction | | `llm_cost_usd` | FLOAT | Same | Migration: `alembic/versions/0021_csv_format_template.py` (based on `0020`). The full uploaded CSV is **not** stored — only the header row plus a single data row (`sample_row`). No `user_id` column exists on this table; the sample is anonymous by construction. This is a deliberate, narrow exception to the otherwise-strict "no holdings persisted" invariant: we keep one row per format so the operator has concrete material to look at when hand-writing a future native parser. One anonymous row carries no portfolio context (no totals, no other positions) and cannot be linked back to an account. ## Components ### `app/services/llm_csv_parser.py` — new Public surface: ```python async def parse_with_llm( raw: bytes, session: AsyncSession, ) -> ParsedPie: """LLM-fallback CSV parser. Decodes raw bytes, detects delimiter and preamble offset, fingerprints the header row, hits the csv_format_templates cache. On miss, calls openrouter.call_llm with headers + 3-5 sample rows to extract a column-mapping, validates it, persists a new template, and applies the mapping. Returns the same ParsedPie shape as parse_t212_csv. """ class LLMParseError(ValueError): """Raised when the LLM call fails or returns an unusable mapping.""" ``` Internal helpers (not exported): - `_detect_dialect(raw: bytes) -> tuple[str, int]` — returns `(delimiter, preamble_rows)`. Uses Python's `csv.Sniffer` for delimiter, then walks rows until the first row whose tokens look like column headers (heuristic: all-strings, none parse as numbers). - `_fingerprint(headers: list[str]) -> str` — lowercases, strips whitespace, joins with `|`, returns sha256 hex. - `_extract_mapping_via_llm(client, headers, samples) -> dict` — builds the system prompt, calls `openrouter.call_llm`, parses the JSON envelope, raises `LLMParseError` on malformed output. - `_validate_mapping(mapping, headers, first_row) -> None` — every named column must exist in `headers`; `qty_col`'s value on `first_row` must parse as a positive number; `cost_col` (if present) must parse as a number. Raises `LLMParseError` on failure. - `_apply_mapping(rows, mapping) -> ParsedPie` — iterates remaining rows, builds `ParsedPosition` instances, computes totals from `qty * avg_cost` when explicit totals aren't present. Reuses without modification: - `app/services/openrouter.py::call_llm` — provider fallback chain + AICall ledger logging - `app/services/csv_import.py::ParsedPie, ParsedPosition, CSVImportError` — same return type, same error hierarchy. `LLMParseError` inherits from `CSVImportError` so the route can catch both as one. ### `app/routers/universe.py::parse_portfolio` — modified Two small changes: 1. Add `Depends(require_paid)` to the route decorator. (Portfolio import has always been advertised as paid; this aligns the implementation.) 2. Wrap the existing `parse_t212_csv` call in a try/except that falls through to `parse_with_llm` on `CSVImportError`: ```python try: pie = parse_t212_csv(raw) except CSVImportError: from app.services.llm_csv_parser import parse_with_llm, LLMParseError try: pie = await parse_with_llm(raw, session) except LLMParseError as e: raise HTTPException(status_code=400, detail=str(e)) ``` Everything below this point in the function — resolve_slice loop, upsert_tickers, inline Yahoo fetch, response build — is unchanged. `pie` has the same shape regardless of branch. ### `app/models.py` — new model `CsvFormatTemplate` declared alongside the other tables. Columns as in the data model table above. ### `app/templates/settings.html` — copy tweak - Section heading: "Import portfolio (Trading 212 CSV)" → "Import portfolio (CSV)" - Drop-zone label: "Drop a T212 pie CSV here" → "Drop your broker's portfolio CSV here" - Drop-zone hint: append " · T212, IBKR, and others auto-detected" after the size limit - The "Export your pie from T212" instructions paragraph stays as a help link — T212 is still the best-documented happy path — but its phrasing softens to "If you use Trading 212…" ## LLM prompt shape System prompt fixes the schema. User message contains headers + samples. ``` SYSTEM: You are an expert at recognising broker portfolio CSV formats. You will be given the header row and 3-5 sample data rows from a CSV. Identify which column contains each field. Return ONLY JSON, no prose. Schema: { "ticker_col": "
", "qty_col": "
", "name_col": "
", "cost_col": "
", // average price per share or unit cost "currency_col": "
", "broker_label": "" } Rules: - Use null when no column is a good match. - ticker_col and qty_col are required; if either is missing return all nulls. - Use the EXACT header string as it appears in the input. USER: headers: ["Symbol","Position","Avg Price","Currency"] samples: AAPL,100,150.00,USD MSFT,50,300.00,USD ... ``` The LLM never sees the entire file; it sees only the first ~5 data rows. Token cost is bounded and uniform regardless of portfolio size. ## Error handling | Failure | Response | Ledger | |---|---|---| | LLM provider down | 502 "couldn't parse — try again later" | AICall status=failed | | LLM returns non-JSON | 400 "couldn't recognise as portfolio CSV" | AICall status=ok, no template stored | | Mapping missing required columns (ticker/qty) | 400 same | AICall status=ok, no template stored | | Mapping references non-existent column | 400 same | AICall status=ok, no template stored | | Mapping validates but row parse fails on numerics | 400 same | template NOT stored | | Cache hit but row parse fails (format drifted under us) | 400 with parse error | — | If a broker quietly changes their CSV shape such that a previously-good cached mapping starts producing parse failures, the user sees an error and the operator deletes the offending `csv_format_templates` row by hand. No automatic eviction, no automatic retry. The cache is a learning store, not a self-managing system. ## Testing `tests/test_llm_csv_parser.py`: - **Fingerprint stability** — case/whitespace/BOM variants of the same headers hash to the same fingerprint. - **Cache hit path** — pre-populate a `CsvFormatTemplate` row, mock `call_llm` to fail loudly, assert it is NOT called, assert positions come out correct, assert `use_count` is incremented. - **Cache miss path** — mock `call_llm` to return a valid mapping JSON, assert a row is inserted with the upload's actual first data row as `sample_row` and no user_id anywhere, assert positions come out correct. - **LLM returns malformed JSON** — raises `LLMParseError`, no template stored. - **LLM maps to non-existent column** — raises `LLMParseError`, no template stored. - **LLM maps qty to a non-numeric column** — raises `LLMParseError` on validation. - **Stale cached mapping on parse failure** — pre-populate a template whose mapping no longer matches the file content, assert a 400 is returned and the template is NOT deleted automatically (operator owns eviction). - **Integration** — POST a fabricated IBKR-shaped fixture to `/api/portfolio/parse`, assert ParsedPie round-trips, assert no second LLM call on a repeat upload. Existing `tests/test_csv_import.py` must still pass — the T212 happy path is unchanged. ## Verification End-to-end manual check after deploy: 1. Upload a T212 fixture → exists path stays unchanged (same dashboard load behaviour). 2. Upload a fabricated IBKR CSV → first upload calls LLM, returns positions, template row created in DB. 3. Re-upload the same IBKR CSV → second call has zero LLM cost (verify by counting `ai_calls` rows before/after), `use_count` increments to 2. 4. Inspect `csv_format_templates` row: confirm `headers` matches the upload's headers, `sample_row` is the first real data row, no `user_id` column exists on the table. 5. Upload random garbage (e.g. a screenshot renamed `.csv`) → 400 with clean error, no template stored, AICall row logged. 6. Free-tier account attempts import → 402 (paid gating). ## Open questions for the implementation plan - Whether to read sample rows with `csv.reader` and re-encode them as text for the LLM (safer for embedded commas/quotes), or pass the raw first-N-lines verbatim. Default: the safer reader path. - Whether to cap LLM-parsed portfolios at the same 1 MB limit as T212 (yes) and whether to add a separate cap on number-of-rows fed to the LLM as samples (yes, 5). - Whether to log the fingerprint to the request log on cache hit/miss for operability. Default: yes, at INFO level, with `event_type="csv.format.cache_hit"` / `"csv.format.cache_miss"`.