read.markets/docs/superpowers/specs/2026-05-27-llm-csv-fallback-parser-design.md
Giorgio Gilestro 0254515989 docs: refine LLM-CSV spec — keep real sample row, drop user attribution
- Drop first_seen_user_id; sample is anonymous by construction
- Rename sample_dummy → sample_row, store the upload's first real data
  row verbatim (one row, no totals, no other positions, no link to a
  user). Narrow, deliberate exception to the "no holdings persisted"
  invariant — gives the operator material for hand-writing future
  native parsers.
- Drop the cache self-heal behaviour; operator owns eviction. Reinforce
  the non-goal of auto-promoting learned formats to code.
2026-05-27 11:21:43 +02:00

14 KiB
Raw Permalink Blame History

LLM-fallback CSV parser — Design Spec

Date: 2026-05-27 Status: Draft — pending implementation plan

Context

Today the only supported broker import is Trading 212. parse_t212_csv expects T212's exact column set (Slice, Owned quantity, etc.) and raises CSVImportError on anything else. Every non-T212 user hits a wall at onboarding.

Rather than write a hand-rolled parser per broker (IBKR, Vanguard, Fidelity, Schwab, eToro, Degiro, …) — and chase format drift forever — we use an LLM as a transparent fallback. The LLM never sees holdings as data; it only looks at headers plus a handful of sample rows and returns a JSON column-mapping. Our existing Python code does the row iteration.

The first time a broker format appears, the LLM produces a mapping. We fingerprint the format (sha256 of normalized headers) and cache the mapping in a new csv_format_templates table. Every subsequent upload of the same format — by any user — replays the cached mapping deterministically, with no LLM call.

The cache row stores the header row and a single anonymous sample data row (the first row from the originating upload, verbatim). No user identifier is recorded — the row is not linked back to whoever uploaded it. The purpose of the sample is to give the operator material to look at when designing future native parsers; this collection is passive learning only, the system never attempts to author or modify parser code automatically.

Portfolio import is already advertised as a paid-only feature; we make that explicit at the route level as part of this work.

Goals

  • Accept CSV exports from any broker, not just T212.
  • Pay the LLM cost only once per format, not once per user.
  • Never persist user holdings on the server (already a system-wide invariant).
  • Surface the same response shape to the browser regardless of which parser branch ran — no client changes beyond a copy tweak.

Non-goals

  • Per-broker UI customisation. The drop-zone stays generic.
  • A human admin queue for reviewing LLM-discovered formats. Operator can inspect rows directly in the DB if curious.
  • Auto-promoting learned formats to native parsers. The operator will hand-write any native parser by looking at the collected sample rows. The system never writes or modifies code.
  • Self-healing or auto-evicting stale cache entries. If a broker silently changes their export shape under us, the cached mapping will start producing parse errors; the operator deletes the row manually. We do not invalidate cache entries automatically.
  • Multi-stage / verification LLM passes. One call per first-time format.

Architecture

POST /api/portfolio/parse  (paid-only)
├─ parse_t212_csv(raw)             ── happy path, unchanged
│   └─ CSVImportError ↴
│
├─ parse_with_llm(raw, session)
│   ├─ detect delimiter + preamble offset
│   ├─ fingerprint = sha256(normalised headers)
│   ├─ SELECT csv_format_templates WHERE fingerprint=?
│   │   ├─ HIT  → apply mapping (bump use_count/last_used_at after successful parse)
│   │   └─ MISS → openrouter.call_llm(headers + 3-5 sample rows)
│   │             → validate mapping
│   │             → INSERT csv_format_templates
│   │             → apply mapping
│   └─ returns ParsedPie  (same shape as T212 path)
│
└─ resolve_slice → upsert_tickers → inline Yahoo fetch → JSON response
   (existing pipeline, unchanged)

Why column-mapping, not full extraction

We pass the LLM only headers plus 35 sample rows, not the full CSV. The LLM returns column names, not transcribed numbers. Three benefits:

  1. Safety — LLMs hallucinate digits; they don't hallucinate column names that aren't there. Mapping validation can verify every named column exists in the actual header row.
  2. Cost — prompt is ~1 KB regardless of portfolio size.
  3. Cacheability — the mapping IS the cache. Replay is deterministic Python, no LLM in the loop on re-imports.

Why global cache, not per-user

The column structure of an IBKR Activity Statement is a property of IBKR, not of any individual user. The cache row contains no user identifier — the sample data row is stored verbatim but anonymously, with nothing linking it to the uploader. Global cache is strictly better: faster onboarding for the second IBKR user, and the collected samples form a small, useful corpus for hand-writing native parsers later.

Data model

New table csv_format_templates:

Column Type Notes
id int PK
fingerprint VARCHAR(64) UNIQUE NOT NULL sha256 hex of normalised header tuple
headers JSON List of strings — actual header row from the upload
sample_row JSON First data row from the originating upload, verbatim. Not linked to any user.
mapping JSON {ticker_col, qty_col, name_col, cost_col, currency_col}
preamble_rows INT NOT NULL DEFAULT 0 Non-data lines before the header row
delimiter CHAR(1) NOT NULL DEFAULT ','
broker_label VARCHAR(128) LLM-identified label, e.g. "Interactive Brokers Activity Statement"
first_seen_at DATETIME(tz) NOT NULL When the format was first cached
use_count INT NOT NULL DEFAULT 1 Bumped on each successful cache hit
last_used_at DATETIME(tz) NOT NULL
llm_model VARCHAR(64) Provenance of the initial extraction
llm_cost_usd FLOAT Same

Migration: alembic/versions/0021_csv_format_template.py (based on 0020).

The full uploaded CSV is not stored — only the header row plus a single data row (sample_row). No user_id column exists on this table; the sample is anonymous by construction. This is a deliberate, narrow exception to the otherwise-strict "no holdings persisted" invariant: we keep one row per format so the operator has concrete material to look at when hand-writing a future native parser. One anonymous row carries no portfolio context (no totals, no other positions) and cannot be linked back to an account.

Components

app/services/llm_csv_parser.py — new

Public surface:

async def parse_with_llm(
    raw: bytes,
    session: AsyncSession,
) -> ParsedPie:
    """LLM-fallback CSV parser.

    Decodes raw bytes, detects delimiter and preamble offset, fingerprints
    the header row, hits the csv_format_templates cache. On miss, calls
    openrouter.call_llm with headers + 3-5 sample rows to extract a
    column-mapping, validates it, persists a new template, and applies the
    mapping. Returns the same ParsedPie shape as parse_t212_csv.
    """

class LLMParseError(ValueError):
    """Raised when the LLM call fails or returns an unusable mapping."""

Internal helpers (not exported):

  • _detect_dialect(raw: bytes) -> tuple[str, int] — returns (delimiter, preamble_rows). Uses Python's csv.Sniffer for delimiter, then walks rows until the first row whose tokens look like column headers (heuristic: all-strings, none parse as numbers).
  • _fingerprint(headers: list[str]) -> str — lowercases, strips whitespace, joins with |, returns sha256 hex.
  • _extract_mapping_via_llm(client, headers, samples) -> dict — builds the system prompt, calls openrouter.call_llm, parses the JSON envelope, raises LLMParseError on malformed output.
  • _validate_mapping(mapping, headers, first_row) -> None — every named column must exist in headers; qty_col's value on first_row must parse as a positive number; cost_col (if present) must parse as a number. Raises LLMParseError on failure.
  • _apply_mapping(rows, mapping) -> ParsedPie — iterates remaining rows, builds ParsedPosition instances, computes totals from qty * avg_cost when explicit totals aren't present.

Reuses without modification:

  • app/services/openrouter.py::call_llm — provider fallback chain + AICall ledger logging
  • app/services/csv_import.py::ParsedPie, ParsedPosition, CSVImportError — same return type, same error hierarchy. LLMParseError inherits from CSVImportError so the route can catch both as one.

app/routers/universe.py::parse_portfolio — modified

Two small changes:

  1. Add Depends(require_paid) to the route decorator. (Portfolio import has always been advertised as paid; this aligns the implementation.)
  2. Wrap the existing parse_t212_csv call in a try/except that falls through to parse_with_llm on CSVImportError:
try:
    pie = parse_t212_csv(raw)
except CSVImportError:
    from app.services.llm_csv_parser import parse_with_llm, LLMParseError
    try:
        pie = await parse_with_llm(raw, session)
    except LLMParseError as e:
        raise HTTPException(status_code=400, detail=str(e))

Everything below this point in the function — resolve_slice loop, upsert_tickers, inline Yahoo fetch, response build — is unchanged. pie has the same shape regardless of branch.

app/models.py — new model

CsvFormatTemplate declared alongside the other tables. Columns as in the data model table above.

app/templates/settings.html — copy tweak

  • Section heading: "Import portfolio (Trading 212 CSV)" → "Import portfolio (CSV)"
  • Drop-zone label: "Drop a T212 pie CSV here" → "Drop your broker's portfolio CSV here"
  • Drop-zone hint: append " · T212, IBKR, and others auto-detected" after the size limit
  • The "Export your pie from T212" instructions paragraph stays as a help link — T212 is still the best-documented happy path — but its phrasing softens to "If you use Trading 212…"

LLM prompt shape

System prompt fixes the schema. User message contains headers + samples.

SYSTEM: You are an expert at recognising broker portfolio CSV formats.
You will be given the header row and 3-5 sample data rows from a CSV.
Identify which column contains each field. Return ONLY JSON, no prose.

Schema:
{
  "ticker_col": "<header name or null>",
  "qty_col":    "<header name or null>",
  "name_col":   "<header name or null>",
  "cost_col":   "<header name or null>",    // average price per share or unit cost
  "currency_col": "<header name or null>",
  "broker_label": "<short identifier like 'IBKR Activity Statement' or null>"
}

Rules:
- Use null when no column is a good match.
- ticker_col and qty_col are required; if either is missing return all nulls.
- Use the EXACT header string as it appears in the input.

USER: headers: ["Symbol","Position","Avg Price","Currency"]
samples:
  AAPL,100,150.00,USD
  MSFT,50,300.00,USD
  ...

The LLM never sees the entire file; it sees only the first ~5 data rows. Token cost is bounded and uniform regardless of portfolio size.

Error handling

Failure Response Ledger
LLM provider down 502 "couldn't parse — try again later" AICall status=failed
LLM returns non-JSON 400 "couldn't recognise as portfolio CSV" AICall status=ok, no template stored
Mapping missing required columns (ticker/qty) 400 same AICall status=ok, no template stored
Mapping references non-existent column 400 same AICall status=ok, no template stored
Mapping validates but row parse fails on numerics 400 same template NOT stored
Cache hit but row parse fails (format drifted under us) 400 with parse error

If a broker quietly changes their CSV shape such that a previously-good cached mapping starts producing parse failures, the user sees an error and the operator deletes the offending csv_format_templates row by hand. No automatic eviction, no automatic retry. The cache is a learning store, not a self-managing system.

Testing

tests/test_llm_csv_parser.py:

  • Fingerprint stability — case/whitespace/BOM variants of the same headers hash to the same fingerprint.
  • Cache hit path — pre-populate a CsvFormatTemplate row, mock call_llm to fail loudly, assert it is NOT called, assert positions come out correct, assert use_count is incremented.
  • Cache miss path — mock call_llm to return a valid mapping JSON, assert a row is inserted with the upload's actual first data row as sample_row and no user_id anywhere, assert positions come out correct.
  • LLM returns malformed JSON — raises LLMParseError, no template stored.
  • LLM maps to non-existent column — raises LLMParseError, no template stored.
  • LLM maps qty to a non-numeric column — raises LLMParseError on validation.
  • Stale cached mapping on parse failure — pre-populate a template whose mapping no longer matches the file content, assert a 400 is returned and the template is NOT deleted automatically (operator owns eviction).
  • Integration — POST a fabricated IBKR-shaped fixture to /api/portfolio/parse, assert ParsedPie round-trips, assert no second LLM call on a repeat upload.

Existing tests/test_csv_import.py must still pass — the T212 happy path is unchanged.

Verification

End-to-end manual check after deploy:

  1. Upload a T212 fixture → exists path stays unchanged (same dashboard load behaviour).
  2. Upload a fabricated IBKR CSV → first upload calls LLM, returns positions, template row created in DB.
  3. Re-upload the same IBKR CSV → second call has zero LLM cost (verify by counting ai_calls rows before/after), use_count increments to 2.
  4. Inspect csv_format_templates row: confirm headers matches the upload's headers, sample_row is the first real data row, no user_id column exists on the table.
  5. Upload random garbage (e.g. a screenshot renamed .csv) → 400 with clean error, no template stored, AICall row logged.
  6. Free-tier account attempts import → 402 (paid gating).

Open questions for the implementation plan

  • Whether to read sample rows with csv.reader and re-encode them as text for the LLM (safer for embedded commas/quotes), or pass the raw first-N-lines verbatim. Default: the safer reader path.
  • Whether to cap LLM-parsed portfolios at the same 1 MB limit as T212 (yes) and whether to add a separate cap on number-of-rows fed to the LLM as samples (yes, 5).
  • Whether to log the fingerprint to the request log on cache hit/miss for operability. Default: yes, at INFO level, with event_type="csv.format.cache_hit" / "csv.format.cache_miss".