Giorgio Gilestro 263ecc0d3b docs: spec for LLM-fallback CSV parser

Transparent fallback after parse_t212_csv: LLM extracts a column-mapping
(not the data), result is cached globally by header fingerprint, replay
is deterministic Python. Stored dummy contains headers + synthetic row
only — no user holdings ever persisted.

2026-05-27 11:15:42 +02:00

13 KiB

Raw Blame History

LLM-fallback CSV parser — Design Spec

Date: 2026-05-27 Status: Draft — pending implementation plan

Context

Today the only supported broker import is Trading 212. parse_t212_csv expects T212's exact column set (Slice, Owned quantity, etc.) and raises CSVImportError on anything else. Every non-T212 user hits a wall at onboarding.

Rather than write a hand-rolled parser per broker (IBKR, Vanguard, Fidelity, Schwab, eToro, Degiro, …) — and chase format drift forever — we use an LLM as a transparent fallback. The LLM never sees holdings as data; it only looks at headers plus a handful of sample rows and returns a JSON column-mapping. Our existing Python code does the row iteration.

The first time a broker format appears, the LLM produces a mapping. We fingerprint the format (sha256 of normalized headers) and cache the mapping in a new csv_format_templates table. Every subsequent upload of the same format — by any user — replays the cached mapping deterministically, with no LLM call.

The cache row stores headers and a synthetic placeholder ("dummy") row, never real user data. The mapping itself is a column-name dictionary, also free of holdings.

Portfolio import is already advertised as a paid-only feature; we make that explicit at the route level as part of this work.

Goals

Accept CSV exports from any broker, not just T212.
Pay the LLM cost only once per format, not once per user.
Never persist user holdings on the server (already a system-wide invariant).
Surface the same response shape to the browser regardless of which parser branch ran — no client changes beyond a copy tweak.

Non-goals

Per-broker UI customisation. The drop-zone stays generic.
A human admin queue for reviewing LLM-discovered formats. Operator can inspect rows directly in the DB if curious.
Promoting "trusted" formats to native parsers. That's a future evolution if a single broker dominates LLM-parsed traffic.
Multi-stage / verification LLM passes. One call per first-time format.

Architecture

POST /api/portfolio/parse  (paid-only)
├─ parse_t212_csv(raw)             ── happy path, unchanged
│   └─ CSVImportError ↴
│
├─ parse_with_llm(raw, session)
│   ├─ detect delimiter + preamble offset
│   ├─ fingerprint = sha256(normalised headers)
│   ├─ SELECT csv_format_templates WHERE fingerprint=?
│   │   ├─ HIT  → apply mapping (bump use_count/last_used_at after successful parse)
│   │   └─ MISS → openrouter.call_llm(headers + 3-5 sample rows)
│   │             → validate mapping
│   │             → INSERT csv_format_templates
│   │             → apply mapping
│   └─ returns ParsedPie  (same shape as T212 path)
│
└─ resolve_slice → upsert_tickers → inline Yahoo fetch → JSON response
   (existing pipeline, unchanged)

Why column-mapping, not full extraction

We pass the LLM only headers plus 3–5 sample rows, not the full CSV. The LLM returns column names, not transcribed numbers. Three benefits:

Safety — LLMs hallucinate digits; they don't hallucinate column names that aren't there. Mapping validation can verify every named column exists in the actual header row.
Cost — prompt is ~1 KB regardless of portfolio size.
Cacheability — the mapping IS the cache. Replay is deterministic Python, no LLM in the loop on re-imports.

Why global cache, not per-user

The column structure of an IBKR Activity Statement is a property of IBKR, not of any individual user. The stored "dummy" contains no PII (column headers are public; the sample row is synthetic). So global cache is strictly better: faster onboarding for the second IBKR user, and the operator still gets a first_seen_user_id audit column for forensic traceability.

Data model

New table csv_format_templates:

Column	Type	Notes
`id`	int PK
`fingerprint`	`VARCHAR(64) UNIQUE NOT NULL`	sha256 hex of normalised header tuple
`headers`	JSON	List of strings — actual header row, no PII
`sample_dummy`	JSON	One synthetic placeholder row for human eyeball
`mapping`	JSON	`{ticker_col, qty_col, name_col, cost_col, currency_col}`
`preamble_rows`	INT NOT NULL DEFAULT 0	Non-data lines before the header row
`delimiter`	CHAR(1) NOT NULL DEFAULT ','
`broker_label`	VARCHAR(128)	LLM-identified label, e.g. "Interactive Brokers Activity Statement"
`first_seen_user_id`	INT NULL, FK users(id) ON DELETE SET NULL	Audit only
`first_seen_at`	DATETIME(tz) NOT NULL
`use_count`	INT NOT NULL DEFAULT 1	Bumped on cache hit
`last_used_at`	DATETIME(tz) NOT NULL
`llm_model`	VARCHAR(64)	Provenance of the initial extraction
`llm_cost_usd`	FLOAT	Same

Migration: alembic/versions/0021_csv_format_template.py (based on 0020).

No raw CSV bytes are ever stored. headers and sample_dummy are the only payloads. sample_dummy is synthesised post-extraction by replacing column values with placeholder strings ("TICKER", "100", "1.50") keyed to the mapping — the operator can eyeball the format shape without seeing any real holdings.

Components

`app/services/llm_csv_parser.py` — new

Public surface:

async def parse_with_llm(
    raw: bytes,
    session: AsyncSession,
) -> ParsedPie:
    """LLM-fallback CSV parser.

    Decodes raw bytes, detects delimiter and preamble offset, fingerprints
    the header row, hits the csv_format_templates cache. On miss, calls
    openrouter.call_llm with headers + 3-5 sample rows to extract a
    column-mapping, validates it, persists a new template, and applies the
    mapping. Returns the same ParsedPie shape as parse_t212_csv.
    """

class LLMParseError(ValueError):
    """Raised when the LLM call fails or returns an unusable mapping."""

Internal helpers (not exported):

_detect_dialect(raw: bytes) -> tuple[str, int] — returns (delimiter, preamble_rows). Uses Python's csv.Sniffer for delimiter, then walks rows until the first row whose tokens look like column headers (heuristic: all-strings, none parse as numbers).
_fingerprint(headers: list[str]) -> str — lowercases, strips whitespace, joins with |, returns sha256 hex.
_extract_mapping_via_llm(client, headers, samples) -> dict — builds the system prompt, calls openrouter.call_llm, parses the JSON envelope, raises LLMParseError on malformed output.
_validate_mapping(mapping, headers, first_row) -> None — every named column must exist in headers; qty_col's value on first_row must parse as a positive number; cost_col (if present) must parse as a number. Raises LLMParseError on failure.
_apply_mapping(rows, mapping) -> ParsedPie — iterates remaining rows, builds ParsedPosition instances, computes totals from qty * avg_cost when explicit totals aren't present.
_synthesise_dummy(headers, mapping) -> dict — produces the placeholder row for sample_dummy.

Reuses without modification:

app/services/openrouter.py::call_llm — provider fallback chain + AICall ledger logging
app/services/csv_import.py::ParsedPie, ParsedPosition, CSVImportError — same return type, same error hierarchy. LLMParseError inherits from CSVImportError so the route can catch both as one.

`app/routers/universe.py::parse_portfolio` — modified

Two small changes:

Add Depends(require_paid) to the route decorator. (Portfolio import has always been advertised as paid; this aligns the implementation.)
Wrap the existing parse_t212_csv call in a try/except that falls through to parse_with_llm on CSVImportError:

try:
    pie = parse_t212_csv(raw)
except CSVImportError:
    from app.services.llm_csv_parser import parse_with_llm, LLMParseError
    try:
        pie = await parse_with_llm(raw, session)
    except LLMParseError as e:
        raise HTTPException(status_code=400, detail=str(e))

Everything below this point in the function — resolve_slice loop, upsert_tickers, inline Yahoo fetch, response build — is unchanged. pie has the same shape regardless of branch.

`app/models.py` — new model

CsvFormatTemplate declared alongside the other tables. Columns as in the data model table above.

`app/templates/settings.html` — copy tweak

Section heading: "Import portfolio (Trading 212 CSV)" → "Import portfolio (CSV)"
Drop-zone label: "Drop a T212 pie CSV here" → "Drop your broker's portfolio CSV here"
Drop-zone hint: append " · T212, IBKR, and others auto-detected" after the size limit
The "Export your pie from T212" instructions paragraph stays as a help link — T212 is still the best-documented happy path — but its phrasing softens to "If you use Trading 212…"

LLM prompt shape

System prompt fixes the schema. User message contains headers + samples.

SYSTEM: You are an expert at recognising broker portfolio CSV formats.
You will be given the header row and 3-5 sample data rows from a CSV.
Identify which column contains each field. Return ONLY JSON, no prose.

Schema:
{
  "ticker_col": "<header name or null>",
  "qty_col":    "<header name or null>",
  "name_col":   "<header name or null>",
  "cost_col":   "<header name or null>",    // average price per share or unit cost
  "currency_col": "<header name or null>",
  "broker_label": "<short identifier like 'IBKR Activity Statement' or null>"
}

Rules:
- Use null when no column is a good match.
- ticker_col and qty_col are required; if either is missing return all nulls.
- Use the EXACT header string as it appears in the input.

USER: headers: ["Symbol","Position","Avg Price","Currency"]
samples:
  AAPL,100,150.00,USD
  MSFT,50,300.00,USD
  ...

The LLM never sees the entire file; it sees only the first ~5 data rows. Token cost is bounded and uniform regardless of portfolio size.

Error handling

Failure	Response	Ledger
LLM provider down	502 "couldn't parse — try again later"	AICall status=failed
LLM returns non-JSON	400 "couldn't recognise as portfolio CSV"	AICall status=ok, no template stored
Mapping missing required columns (ticker/qty)	400 same	AICall status=ok, no template stored
Mapping references non-existent column	400 same	AICall status=ok, no template stored
Mapping validates but row parse fails on numerics	400 same	template NOT stored
Cache hit but row parse fails (format drifted under us)	400 + evict the stale template in its own commit before raising	—

The "delete stale template on parse failure" rule is the only self-healing behaviour: if a broker quietly changes their export shape, the next failed re-import evicts the old mapping in a dedicated commit (so the eviction survives the request-failure rollback) and the LLM gets another shot on the subsequent upload. Without this, a once-good template would haunt the cache forever. We do not auto-retry the LLM in the same request — too much hidden cost on a single user action.

Testing

tests/test_llm_csv_parser.py:

Fingerprint stability — case/whitespace/BOM variants of the same headers hash to the same fingerprint.
Cache hit path — pre-populate a CsvFormatTemplate row, mock call_llm to fail loudly, assert it is NOT called, assert positions come out correct.
Cache miss path — mock call_llm to return a valid mapping JSON, assert a row is inserted, assert positions come out correct, assert the synthesised sample_dummy contains placeholder strings only.
LLM returns malformed JSON — raises LLMParseError, no template stored.
LLM maps to non-existent column — raises LLMParseError, no template stored.
LLM maps qty to a non-numeric column — raises LLMParseError on validation.
Stale template self-heal — pre-populate a template that no longer matches the file, simulate row-parse failure, assert the row is deleted and a 400 returned.
Integration — POST a fabricated IBKR-shaped fixture to /api/portfolio/parse, assert ParsedPie round-trips, assert no second LLM call on a repeat upload.

Existing tests/test_csv_import.py must still pass — the T212 happy path is unchanged.

Verification

End-to-end manual check after deploy:

Upload a T212 fixture → exists path stays unchanged (same dashboard load behaviour).
Upload a fabricated IBKR CSV → first upload calls LLM, returns positions, template row created in DB.
Re-upload the same IBKR CSV → second call has zero LLM cost (verify by counting ai_calls rows before/after), use_count increments to 2.
Inspect csv_format_templates row: confirm headers matches the upload's headers, sample_dummy contains placeholder strings, no real holdings anywhere.
Upload random garbage (e.g. a screenshot renamed .csv) → 400 with clean error, no template stored, AICall row logged.
Free-tier account attempts import → 402 (paid gating).

Open questions for the implementation plan

Whether to read sample rows with csv.reader and re-encode them as text for the LLM (safer for embedded commas/quotes), or pass the raw first-N-lines verbatim. Default: the safer reader path.
Whether to cap LLM-parsed portfolios at the same 1 MB limit as T212 (yes) and whether to add a separate cap on number-of-rows fed to the LLM as samples (yes, 5).
Whether to log the fingerprint to the request log on cache hit/miss for operability. Default: yes, at INFO level, with event_type="csv.format.cache_hit" / "csv.format.cache_miss".

13 KiB Raw Blame History Unescape Escape