read.markets/docs/superpowers/specs/2026-05-27-llm-csv-fallback-parser-design.md
Giorgio Gilestro 263ecc0d3b docs: spec for LLM-fallback CSV parser
Transparent fallback after parse_t212_csv: LLM extracts a column-mapping
(not the data), result is cached globally by header fingerprint, replay
is deterministic Python. Stored dummy contains headers + synthetic row
only — no user holdings ever persisted.
2026-05-27 11:15:42 +02:00

272 lines
13 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# LLM-fallback CSV parser — Design Spec
**Date:** 2026-05-27
**Status:** Draft — pending implementation plan
## Context
Today the only supported broker import is Trading 212. `parse_t212_csv` expects
T212's exact column set (`Slice`, `Owned quantity`, etc.) and raises
`CSVImportError` on anything else. Every non-T212 user hits a wall at
onboarding.
Rather than write a hand-rolled parser per broker (IBKR, Vanguard, Fidelity,
Schwab, eToro, Degiro, …) — and chase format drift forever — we use an LLM as
a transparent fallback. The LLM never sees holdings as data; it only looks at
**headers plus a handful of sample rows** and returns a JSON column-mapping.
Our existing Python code does the row iteration.
The first time a broker format appears, the LLM produces a mapping. We
fingerprint the format (sha256 of normalized headers) and cache the mapping
in a new `csv_format_templates` table. Every subsequent upload of the same
format — by any user — replays the cached mapping deterministically, with no
LLM call.
The cache row stores headers and a synthetic placeholder ("dummy") row, never
real user data. The mapping itself is a column-name dictionary, also free of
holdings.
Portfolio import is already advertised as a paid-only feature; we make that
explicit at the route level as part of this work.
## Goals
- Accept CSV exports from any broker, not just T212.
- Pay the LLM cost only once per **format**, not once per user.
- Never persist user holdings on the server (already a system-wide invariant).
- Surface the same response shape to the browser regardless of which parser
branch ran — no client changes beyond a copy tweak.
## Non-goals
- Per-broker UI customisation. The drop-zone stays generic.
- A human admin queue for reviewing LLM-discovered formats. Operator can
inspect rows directly in the DB if curious.
- Promoting "trusted" formats to native parsers. That's a future evolution
if a single broker dominates LLM-parsed traffic.
- Multi-stage / verification LLM passes. One call per first-time format.
## Architecture
```
POST /api/portfolio/parse (paid-only)
├─ parse_t212_csv(raw) ── happy path, unchanged
│ └─ CSVImportError ↴
├─ parse_with_llm(raw, session)
│ ├─ detect delimiter + preamble offset
│ ├─ fingerprint = sha256(normalised headers)
│ ├─ SELECT csv_format_templates WHERE fingerprint=?
│ │ ├─ HIT → apply mapping (bump use_count/last_used_at after successful parse)
│ │ └─ MISS → openrouter.call_llm(headers + 3-5 sample rows)
│ │ → validate mapping
│ │ → INSERT csv_format_templates
│ │ → apply mapping
│ └─ returns ParsedPie (same shape as T212 path)
└─ resolve_slice → upsert_tickers → inline Yahoo fetch → JSON response
(existing pipeline, unchanged)
```
### Why column-mapping, not full extraction
We pass the LLM only **headers plus 35 sample rows**, not the full CSV. The
LLM returns column names, not transcribed numbers. Three benefits:
1. **Safety** — LLMs hallucinate digits; they don't hallucinate column names
that aren't there. Mapping validation can verify every named column exists
in the actual header row.
2. **Cost** — prompt is ~1 KB regardless of portfolio size.
3. **Cacheability** — the mapping IS the cache. Replay is deterministic Python,
no LLM in the loop on re-imports.
### Why global cache, not per-user
The column structure of an IBKR Activity Statement is a property of IBKR, not
of any individual user. The stored "dummy" contains no PII (column headers
are public; the sample row is synthetic). So global cache is strictly better:
faster onboarding for the second IBKR user, and the operator still gets a
`first_seen_user_id` audit column for forensic traceability.
## Data model
New table `csv_format_templates`:
| Column | Type | Notes |
|---|---|---|
| `id` | int PK | |
| `fingerprint` | `VARCHAR(64) UNIQUE NOT NULL` | sha256 hex of normalised header tuple |
| `headers` | JSON | List of strings — actual header row, no PII |
| `sample_dummy` | JSON | One synthetic placeholder row for human eyeball |
| `mapping` | JSON | `{ticker_col, qty_col, name_col, cost_col, currency_col}` |
| `preamble_rows` | INT NOT NULL DEFAULT 0 | Non-data lines before the header row |
| `delimiter` | CHAR(1) NOT NULL DEFAULT ',' | |
| `broker_label` | VARCHAR(128) | LLM-identified label, e.g. "Interactive Brokers Activity Statement" |
| `first_seen_user_id` | INT NULL, FK users(id) ON DELETE SET NULL | Audit only |
| `first_seen_at` | DATETIME(tz) NOT NULL | |
| `use_count` | INT NOT NULL DEFAULT 1 | Bumped on cache hit |
| `last_used_at` | DATETIME(tz) NOT NULL | |
| `llm_model` | VARCHAR(64) | Provenance of the initial extraction |
| `llm_cost_usd` | FLOAT | Same |
Migration: `alembic/versions/0021_csv_format_template.py` (based on `0020`).
No raw CSV bytes are ever stored. `headers` and `sample_dummy` are the only
payloads. `sample_dummy` is synthesised post-extraction by replacing column
values with placeholder strings (`"TICKER"`, `"100"`, `"1.50"`) keyed to the
mapping — the operator can eyeball the format shape without seeing any real
holdings.
## Components
### `app/services/llm_csv_parser.py` — new
Public surface:
```python
async def parse_with_llm(
raw: bytes,
session: AsyncSession,
) -> ParsedPie:
"""LLM-fallback CSV parser.
Decodes raw bytes, detects delimiter and preamble offset, fingerprints
the header row, hits the csv_format_templates cache. On miss, calls
openrouter.call_llm with headers + 3-5 sample rows to extract a
column-mapping, validates it, persists a new template, and applies the
mapping. Returns the same ParsedPie shape as parse_t212_csv.
"""
class LLMParseError(ValueError):
"""Raised when the LLM call fails or returns an unusable mapping."""
```
Internal helpers (not exported):
- `_detect_dialect(raw: bytes) -> tuple[str, int]` — returns `(delimiter, preamble_rows)`. Uses Python's `csv.Sniffer` for delimiter, then walks rows until the first row whose tokens look like column headers (heuristic: all-strings, none parse as numbers).
- `_fingerprint(headers: list[str]) -> str` — lowercases, strips whitespace, joins with `|`, returns sha256 hex.
- `_extract_mapping_via_llm(client, headers, samples) -> dict` — builds the system prompt, calls `openrouter.call_llm`, parses the JSON envelope, raises `LLMParseError` on malformed output.
- `_validate_mapping(mapping, headers, first_row) -> None` — every named column must exist in `headers`; `qty_col`'s value on `first_row` must parse as a positive number; `cost_col` (if present) must parse as a number. Raises `LLMParseError` on failure.
- `_apply_mapping(rows, mapping) -> ParsedPie` — iterates remaining rows, builds `ParsedPosition` instances, computes totals from `qty * avg_cost` when explicit totals aren't present.
- `_synthesise_dummy(headers, mapping) -> dict` — produces the placeholder row for `sample_dummy`.
Reuses without modification:
- `app/services/openrouter.py::call_llm` — provider fallback chain + AICall ledger logging
- `app/services/csv_import.py::ParsedPie, ParsedPosition, CSVImportError` — same return type, same error hierarchy. `LLMParseError` inherits from `CSVImportError` so the route can catch both as one.
### `app/routers/universe.py::parse_portfolio` — modified
Two small changes:
1. Add `Depends(require_paid)` to the route decorator. (Portfolio import has always been advertised as paid; this aligns the implementation.)
2. Wrap the existing `parse_t212_csv` call in a try/except that falls through to `parse_with_llm` on `CSVImportError`:
```python
try:
pie = parse_t212_csv(raw)
except CSVImportError:
from app.services.llm_csv_parser import parse_with_llm, LLMParseError
try:
pie = await parse_with_llm(raw, session)
except LLMParseError as e:
raise HTTPException(status_code=400, detail=str(e))
```
Everything below this point in the function — resolve_slice loop, upsert_tickers, inline Yahoo fetch, response build — is unchanged. `pie` has the same shape regardless of branch.
### `app/models.py` — new model
`CsvFormatTemplate` declared alongside the other tables. Columns as in the data model table above.
### `app/templates/settings.html` — copy tweak
- Section heading: "Import portfolio (Trading 212 CSV)" → "Import portfolio (CSV)"
- Drop-zone label: "Drop a T212 pie CSV here" → "Drop your broker's portfolio CSV here"
- Drop-zone hint: append " · T212, IBKR, and others auto-detected" after the size limit
- The "Export your pie from T212" instructions paragraph stays as a help link — T212 is still the best-documented happy path — but its phrasing softens to "If you use Trading 212…"
## LLM prompt shape
System prompt fixes the schema. User message contains headers + samples.
```
SYSTEM: You are an expert at recognising broker portfolio CSV formats.
You will be given the header row and 3-5 sample data rows from a CSV.
Identify which column contains each field. Return ONLY JSON, no prose.
Schema:
{
"ticker_col": "<header name or null>",
"qty_col": "<header name or null>",
"name_col": "<header name or null>",
"cost_col": "<header name or null>", // average price per share or unit cost
"currency_col": "<header name or null>",
"broker_label": "<short identifier like 'IBKR Activity Statement' or null>"
}
Rules:
- Use null when no column is a good match.
- ticker_col and qty_col are required; if either is missing return all nulls.
- Use the EXACT header string as it appears in the input.
USER: headers: ["Symbol","Position","Avg Price","Currency"]
samples:
AAPL,100,150.00,USD
MSFT,50,300.00,USD
...
```
The LLM never sees the entire file; it sees only the first ~5 data rows.
Token cost is bounded and uniform regardless of portfolio size.
## Error handling
| Failure | Response | Ledger |
|---|---|---|
| LLM provider down | 502 "couldn't parse — try again later" | AICall status=failed |
| LLM returns non-JSON | 400 "couldn't recognise as portfolio CSV" | AICall status=ok, no template stored |
| Mapping missing required columns (ticker/qty) | 400 same | AICall status=ok, no template stored |
| Mapping references non-existent column | 400 same | AICall status=ok, no template stored |
| Mapping validates but row parse fails on numerics | 400 same | template NOT stored |
| Cache hit but row parse fails (format drifted under us) | 400 + evict the stale template in its own commit before raising | — |
The "delete stale template on parse failure" rule is the only self-healing
behaviour: if a broker quietly changes their export shape, the next failed
re-import evicts the old mapping in a dedicated commit (so the eviction
survives the request-failure rollback) and the LLM gets another shot on the
subsequent upload. Without this, a once-good template would haunt the cache
forever. We do **not** auto-retry the LLM in the same request — too much
hidden cost on a single user action.
## Testing
`tests/test_llm_csv_parser.py`:
- **Fingerprint stability** — case/whitespace/BOM variants of the same headers hash to the same fingerprint.
- **Cache hit path** — pre-populate a `CsvFormatTemplate` row, mock `call_llm` to fail loudly, assert it is NOT called, assert positions come out correct.
- **Cache miss path** — mock `call_llm` to return a valid mapping JSON, assert a row is inserted, assert positions come out correct, assert the synthesised `sample_dummy` contains placeholder strings only.
- **LLM returns malformed JSON** — raises `LLMParseError`, no template stored.
- **LLM maps to non-existent column** — raises `LLMParseError`, no template stored.
- **LLM maps qty to a non-numeric column** — raises `LLMParseError` on validation.
- **Stale template self-heal** — pre-populate a template that no longer matches the file, simulate row-parse failure, assert the row is deleted and a 400 returned.
- **Integration** — POST a fabricated IBKR-shaped fixture to `/api/portfolio/parse`, assert ParsedPie round-trips, assert no second LLM call on a repeat upload.
Existing `tests/test_csv_import.py` must still pass — the T212 happy path is unchanged.
## Verification
End-to-end manual check after deploy:
1. Upload a T212 fixture → exists path stays unchanged (same dashboard load behaviour).
2. Upload a fabricated IBKR CSV → first upload calls LLM, returns positions, template row created in DB.
3. Re-upload the same IBKR CSV → second call has zero LLM cost (verify by counting `ai_calls` rows before/after), `use_count` increments to 2.
4. Inspect `csv_format_templates` row: confirm `headers` matches the upload's headers, `sample_dummy` contains placeholder strings, no real holdings anywhere.
5. Upload random garbage (e.g. a screenshot renamed `.csv`) → 400 with clean error, no template stored, AICall row logged.
6. Free-tier account attempts import → 402 (paid gating).
## Open questions for the implementation plan
- Whether to read sample rows with `csv.reader` and re-encode them as text for the LLM (safer for embedded commas/quotes), or pass the raw first-N-lines verbatim. Default: the safer reader path.
- Whether to cap LLM-parsed portfolios at the same 1 MB limit as T212 (yes) and whether to add a separate cap on number-of-rows fed to the LLM as samples (yes, 5).
- Whether to log the fingerprint to the request log on cache hit/miss for operability. Default: yes, at INFO level, with `event_type="csv.format.cache_hit"` / `"csv.format.cache_miss"`.