read.markets/docs/superpowers/specs/2026-05-27-llm-csv-fallback-parser-design.md
Giorgio Gilestro 0254515989 docs: refine LLM-CSV spec — keep real sample row, drop user attribution
- Drop first_seen_user_id; sample is anonymous by construction
- Rename sample_dummy → sample_row, store the upload's first real data
  row verbatim (one row, no totals, no other positions, no link to a
  user). Narrow, deliberate exception to the "no holdings persisted"
  invariant — gives the operator material for hand-writing future
  native parsers.
- Drop the cache self-heal behaviour; operator owns eviction. Reinforce
  the non-goal of auto-promoting learned formats to code.
2026-05-27 11:21:43 +02:00

279 lines
14 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# LLM-fallback CSV parser — Design Spec
**Date:** 2026-05-27
**Status:** Draft — pending implementation plan
## Context
Today the only supported broker import is Trading 212. `parse_t212_csv` expects
T212's exact column set (`Slice`, `Owned quantity`, etc.) and raises
`CSVImportError` on anything else. Every non-T212 user hits a wall at
onboarding.
Rather than write a hand-rolled parser per broker (IBKR, Vanguard, Fidelity,
Schwab, eToro, Degiro, …) — and chase format drift forever — we use an LLM as
a transparent fallback. The LLM never sees holdings as data; it only looks at
**headers plus a handful of sample rows** and returns a JSON column-mapping.
Our existing Python code does the row iteration.
The first time a broker format appears, the LLM produces a mapping. We
fingerprint the format (sha256 of normalized headers) and cache the mapping
in a new `csv_format_templates` table. Every subsequent upload of the same
format — by any user — replays the cached mapping deterministically, with no
LLM call.
The cache row stores the header row and a single anonymous sample data row
(the first row from the originating upload, verbatim). No user identifier is
recorded — the row is not linked back to whoever uploaded it. The purpose of
the sample is to give the operator material to look at when designing future
native parsers; this collection is **passive learning only**, the system
never attempts to author or modify parser code automatically.
Portfolio import is already advertised as a paid-only feature; we make that
explicit at the route level as part of this work.
## Goals
- Accept CSV exports from any broker, not just T212.
- Pay the LLM cost only once per **format**, not once per user.
- Never persist user holdings on the server (already a system-wide invariant).
- Surface the same response shape to the browser regardless of which parser
branch ran — no client changes beyond a copy tweak.
## Non-goals
- Per-broker UI customisation. The drop-zone stays generic.
- A human admin queue for reviewing LLM-discovered formats. Operator can
inspect rows directly in the DB if curious.
- **Auto-promoting learned formats to native parsers.** The operator will
hand-write any native parser by looking at the collected sample rows. The
system never writes or modifies code.
- Self-healing or auto-evicting stale cache entries. If a broker silently
changes their export shape under us, the cached mapping will start
producing parse errors; the operator deletes the row manually. We do not
invalidate cache entries automatically.
- Multi-stage / verification LLM passes. One call per first-time format.
## Architecture
```
POST /api/portfolio/parse (paid-only)
├─ parse_t212_csv(raw) ── happy path, unchanged
│ └─ CSVImportError ↴
├─ parse_with_llm(raw, session)
│ ├─ detect delimiter + preamble offset
│ ├─ fingerprint = sha256(normalised headers)
│ ├─ SELECT csv_format_templates WHERE fingerprint=?
│ │ ├─ HIT → apply mapping (bump use_count/last_used_at after successful parse)
│ │ └─ MISS → openrouter.call_llm(headers + 3-5 sample rows)
│ │ → validate mapping
│ │ → INSERT csv_format_templates
│ │ → apply mapping
│ └─ returns ParsedPie (same shape as T212 path)
└─ resolve_slice → upsert_tickers → inline Yahoo fetch → JSON response
(existing pipeline, unchanged)
```
### Why column-mapping, not full extraction
We pass the LLM only **headers plus 35 sample rows**, not the full CSV. The
LLM returns column names, not transcribed numbers. Three benefits:
1. **Safety** — LLMs hallucinate digits; they don't hallucinate column names
that aren't there. Mapping validation can verify every named column exists
in the actual header row.
2. **Cost** — prompt is ~1 KB regardless of portfolio size.
3. **Cacheability** — the mapping IS the cache. Replay is deterministic Python,
no LLM in the loop on re-imports.
### Why global cache, not per-user
The column structure of an IBKR Activity Statement is a property of IBKR, not
of any individual user. The cache row contains no user identifier — the
sample data row is stored verbatim but anonymously, with nothing linking it
to the uploader. Global cache is strictly better: faster onboarding for the
second IBKR user, and the collected samples form a small, useful corpus for
hand-writing native parsers later.
## Data model
New table `csv_format_templates`:
| Column | Type | Notes |
|---|---|---|
| `id` | int PK | |
| `fingerprint` | `VARCHAR(64) UNIQUE NOT NULL` | sha256 hex of normalised header tuple |
| `headers` | JSON | List of strings — actual header row from the upload |
| `sample_row` | JSON | First data row from the originating upload, verbatim. Not linked to any user. |
| `mapping` | JSON | `{ticker_col, qty_col, name_col, cost_col, currency_col}` |
| `preamble_rows` | INT NOT NULL DEFAULT 0 | Non-data lines before the header row |
| `delimiter` | CHAR(1) NOT NULL DEFAULT ',' | |
| `broker_label` | VARCHAR(128) | LLM-identified label, e.g. "Interactive Brokers Activity Statement" |
| `first_seen_at` | DATETIME(tz) NOT NULL | When the format was first cached |
| `use_count` | INT NOT NULL DEFAULT 1 | Bumped on each successful cache hit |
| `last_used_at` | DATETIME(tz) NOT NULL | |
| `llm_model` | VARCHAR(64) | Provenance of the initial extraction |
| `llm_cost_usd` | FLOAT | Same |
Migration: `alembic/versions/0021_csv_format_template.py` (based on `0020`).
The full uploaded CSV is **not** stored — only the header row plus a single
data row (`sample_row`). No `user_id` column exists on this table; the sample
is anonymous by construction. This is a deliberate, narrow exception to the
otherwise-strict "no holdings persisted" invariant: we keep one row per
format so the operator has concrete material to look at when hand-writing a
future native parser. One anonymous row carries no portfolio context (no
totals, no other positions) and cannot be linked back to an account.
## Components
### `app/services/llm_csv_parser.py` — new
Public surface:
```python
async def parse_with_llm(
raw: bytes,
session: AsyncSession,
) -> ParsedPie:
"""LLM-fallback CSV parser.
Decodes raw bytes, detects delimiter and preamble offset, fingerprints
the header row, hits the csv_format_templates cache. On miss, calls
openrouter.call_llm with headers + 3-5 sample rows to extract a
column-mapping, validates it, persists a new template, and applies the
mapping. Returns the same ParsedPie shape as parse_t212_csv.
"""
class LLMParseError(ValueError):
"""Raised when the LLM call fails or returns an unusable mapping."""
```
Internal helpers (not exported):
- `_detect_dialect(raw: bytes) -> tuple[str, int]` — returns `(delimiter, preamble_rows)`. Uses Python's `csv.Sniffer` for delimiter, then walks rows until the first row whose tokens look like column headers (heuristic: all-strings, none parse as numbers).
- `_fingerprint(headers: list[str]) -> str` — lowercases, strips whitespace, joins with `|`, returns sha256 hex.
- `_extract_mapping_via_llm(client, headers, samples) -> dict` — builds the system prompt, calls `openrouter.call_llm`, parses the JSON envelope, raises `LLMParseError` on malformed output.
- `_validate_mapping(mapping, headers, first_row) -> None` — every named column must exist in `headers`; `qty_col`'s value on `first_row` must parse as a positive number; `cost_col` (if present) must parse as a number. Raises `LLMParseError` on failure.
- `_apply_mapping(rows, mapping) -> ParsedPie` — iterates remaining rows, builds `ParsedPosition` instances, computes totals from `qty * avg_cost` when explicit totals aren't present.
Reuses without modification:
- `app/services/openrouter.py::call_llm` — provider fallback chain + AICall ledger logging
- `app/services/csv_import.py::ParsedPie, ParsedPosition, CSVImportError` — same return type, same error hierarchy. `LLMParseError` inherits from `CSVImportError` so the route can catch both as one.
### `app/routers/universe.py::parse_portfolio` — modified
Two small changes:
1. Add `Depends(require_paid)` to the route decorator. (Portfolio import has always been advertised as paid; this aligns the implementation.)
2. Wrap the existing `parse_t212_csv` call in a try/except that falls through to `parse_with_llm` on `CSVImportError`:
```python
try:
pie = parse_t212_csv(raw)
except CSVImportError:
from app.services.llm_csv_parser import parse_with_llm, LLMParseError
try:
pie = await parse_with_llm(raw, session)
except LLMParseError as e:
raise HTTPException(status_code=400, detail=str(e))
```
Everything below this point in the function — resolve_slice loop, upsert_tickers, inline Yahoo fetch, response build — is unchanged. `pie` has the same shape regardless of branch.
### `app/models.py` — new model
`CsvFormatTemplate` declared alongside the other tables. Columns as in the data model table above.
### `app/templates/settings.html` — copy tweak
- Section heading: "Import portfolio (Trading 212 CSV)" → "Import portfolio (CSV)"
- Drop-zone label: "Drop a T212 pie CSV here" → "Drop your broker's portfolio CSV here"
- Drop-zone hint: append " · T212, IBKR, and others auto-detected" after the size limit
- The "Export your pie from T212" instructions paragraph stays as a help link — T212 is still the best-documented happy path — but its phrasing softens to "If you use Trading 212…"
## LLM prompt shape
System prompt fixes the schema. User message contains headers + samples.
```
SYSTEM: You are an expert at recognising broker portfolio CSV formats.
You will be given the header row and 3-5 sample data rows from a CSV.
Identify which column contains each field. Return ONLY JSON, no prose.
Schema:
{
"ticker_col": "<header name or null>",
"qty_col": "<header name or null>",
"name_col": "<header name or null>",
"cost_col": "<header name or null>", // average price per share or unit cost
"currency_col": "<header name or null>",
"broker_label": "<short identifier like 'IBKR Activity Statement' or null>"
}
Rules:
- Use null when no column is a good match.
- ticker_col and qty_col are required; if either is missing return all nulls.
- Use the EXACT header string as it appears in the input.
USER: headers: ["Symbol","Position","Avg Price","Currency"]
samples:
AAPL,100,150.00,USD
MSFT,50,300.00,USD
...
```
The LLM never sees the entire file; it sees only the first ~5 data rows.
Token cost is bounded and uniform regardless of portfolio size.
## Error handling
| Failure | Response | Ledger |
|---|---|---|
| LLM provider down | 502 "couldn't parse — try again later" | AICall status=failed |
| LLM returns non-JSON | 400 "couldn't recognise as portfolio CSV" | AICall status=ok, no template stored |
| Mapping missing required columns (ticker/qty) | 400 same | AICall status=ok, no template stored |
| Mapping references non-existent column | 400 same | AICall status=ok, no template stored |
| Mapping validates but row parse fails on numerics | 400 same | template NOT stored |
| Cache hit but row parse fails (format drifted under us) | 400 with parse error | — |
If a broker quietly changes their CSV shape such that a previously-good
cached mapping starts producing parse failures, the user sees an error and
the operator deletes the offending `csv_format_templates` row by hand. No
automatic eviction, no automatic retry. The cache is a learning store, not
a self-managing system.
## Testing
`tests/test_llm_csv_parser.py`:
- **Fingerprint stability** — case/whitespace/BOM variants of the same headers hash to the same fingerprint.
- **Cache hit path** — pre-populate a `CsvFormatTemplate` row, mock `call_llm` to fail loudly, assert it is NOT called, assert positions come out correct, assert `use_count` is incremented.
- **Cache miss path** — mock `call_llm` to return a valid mapping JSON, assert a row is inserted with the upload's actual first data row as `sample_row` and no user_id anywhere, assert positions come out correct.
- **LLM returns malformed JSON** — raises `LLMParseError`, no template stored.
- **LLM maps to non-existent column** — raises `LLMParseError`, no template stored.
- **LLM maps qty to a non-numeric column** — raises `LLMParseError` on validation.
- **Stale cached mapping on parse failure** — pre-populate a template whose mapping no longer matches the file content, assert a 400 is returned and the template is NOT deleted automatically (operator owns eviction).
- **Integration** — POST a fabricated IBKR-shaped fixture to `/api/portfolio/parse`, assert ParsedPie round-trips, assert no second LLM call on a repeat upload.
Existing `tests/test_csv_import.py` must still pass — the T212 happy path is unchanged.
## Verification
End-to-end manual check after deploy:
1. Upload a T212 fixture → exists path stays unchanged (same dashboard load behaviour).
2. Upload a fabricated IBKR CSV → first upload calls LLM, returns positions, template row created in DB.
3. Re-upload the same IBKR CSV → second call has zero LLM cost (verify by counting `ai_calls` rows before/after), `use_count` increments to 2.
4. Inspect `csv_format_templates` row: confirm `headers` matches the upload's headers, `sample_row` is the first real data row, no `user_id` column exists on the table.
5. Upload random garbage (e.g. a screenshot renamed `.csv`) → 400 with clean error, no template stored, AICall row logged.
6. Free-tier account attempts import → 402 (paid gating).
## Open questions for the implementation plan
- Whether to read sample rows with `csv.reader` and re-encode them as text for the LLM (safer for embedded commas/quotes), or pass the raw first-N-lines verbatim. Default: the safer reader path.
- Whether to cap LLM-parsed portfolios at the same 1 MB limit as T212 (yes) and whether to add a separate cap on number-of-rows fed to the LLM as samples (yes, 5).
- Whether to log the fingerprint to the request log on cache hit/miss for operability. Default: yes, at INFO level, with `event_type="csv.format.cache_hit"` / `"csv.format.cache_miss"`.