Transparent fallback after parse_t212_csv: LLM extracts a column-mapping (not the data), result is cached globally by header fingerprint, replay is deterministic Python. Stored dummy contains headers + synthetic row only — no user holdings ever persisted.
13 KiB
LLM-fallback CSV parser — Design Spec
Date: 2026-05-27 Status: Draft — pending implementation plan
Context
Today the only supported broker import is Trading 212. parse_t212_csv expects
T212's exact column set (Slice, Owned quantity, etc.) and raises
CSVImportError on anything else. Every non-T212 user hits a wall at
onboarding.
Rather than write a hand-rolled parser per broker (IBKR, Vanguard, Fidelity, Schwab, eToro, Degiro, …) — and chase format drift forever — we use an LLM as a transparent fallback. The LLM never sees holdings as data; it only looks at headers plus a handful of sample rows and returns a JSON column-mapping. Our existing Python code does the row iteration.
The first time a broker format appears, the LLM produces a mapping. We
fingerprint the format (sha256 of normalized headers) and cache the mapping
in a new csv_format_templates table. Every subsequent upload of the same
format — by any user — replays the cached mapping deterministically, with no
LLM call.
The cache row stores headers and a synthetic placeholder ("dummy") row, never real user data. The mapping itself is a column-name dictionary, also free of holdings.
Portfolio import is already advertised as a paid-only feature; we make that explicit at the route level as part of this work.
Goals
- Accept CSV exports from any broker, not just T212.
- Pay the LLM cost only once per format, not once per user.
- Never persist user holdings on the server (already a system-wide invariant).
- Surface the same response shape to the browser regardless of which parser branch ran — no client changes beyond a copy tweak.
Non-goals
- Per-broker UI customisation. The drop-zone stays generic.
- A human admin queue for reviewing LLM-discovered formats. Operator can inspect rows directly in the DB if curious.
- Promoting "trusted" formats to native parsers. That's a future evolution if a single broker dominates LLM-parsed traffic.
- Multi-stage / verification LLM passes. One call per first-time format.
Architecture
POST /api/portfolio/parse (paid-only)
├─ parse_t212_csv(raw) ── happy path, unchanged
│ └─ CSVImportError ↴
│
├─ parse_with_llm(raw, session)
│ ├─ detect delimiter + preamble offset
│ ├─ fingerprint = sha256(normalised headers)
│ ├─ SELECT csv_format_templates WHERE fingerprint=?
│ │ ├─ HIT → apply mapping (bump use_count/last_used_at after successful parse)
│ │ └─ MISS → openrouter.call_llm(headers + 3-5 sample rows)
│ │ → validate mapping
│ │ → INSERT csv_format_templates
│ │ → apply mapping
│ └─ returns ParsedPie (same shape as T212 path)
│
└─ resolve_slice → upsert_tickers → inline Yahoo fetch → JSON response
(existing pipeline, unchanged)
Why column-mapping, not full extraction
We pass the LLM only headers plus 3–5 sample rows, not the full CSV. The LLM returns column names, not transcribed numbers. Three benefits:
- Safety — LLMs hallucinate digits; they don't hallucinate column names that aren't there. Mapping validation can verify every named column exists in the actual header row.
- Cost — prompt is ~1 KB regardless of portfolio size.
- Cacheability — the mapping IS the cache. Replay is deterministic Python, no LLM in the loop on re-imports.
Why global cache, not per-user
The column structure of an IBKR Activity Statement is a property of IBKR, not
of any individual user. The stored "dummy" contains no PII (column headers
are public; the sample row is synthetic). So global cache is strictly better:
faster onboarding for the second IBKR user, and the operator still gets a
first_seen_user_id audit column for forensic traceability.
Data model
New table csv_format_templates:
| Column | Type | Notes |
|---|---|---|
id |
int PK | |
fingerprint |
VARCHAR(64) UNIQUE NOT NULL |
sha256 hex of normalised header tuple |
headers |
JSON | List of strings — actual header row, no PII |
sample_dummy |
JSON | One synthetic placeholder row for human eyeball |
mapping |
JSON | {ticker_col, qty_col, name_col, cost_col, currency_col} |
preamble_rows |
INT NOT NULL DEFAULT 0 | Non-data lines before the header row |
delimiter |
CHAR(1) NOT NULL DEFAULT ',' | |
broker_label |
VARCHAR(128) | LLM-identified label, e.g. "Interactive Brokers Activity Statement" |
first_seen_user_id |
INT NULL, FK users(id) ON DELETE SET NULL | Audit only |
first_seen_at |
DATETIME(tz) NOT NULL | |
use_count |
INT NOT NULL DEFAULT 1 | Bumped on cache hit |
last_used_at |
DATETIME(tz) NOT NULL | |
llm_model |
VARCHAR(64) | Provenance of the initial extraction |
llm_cost_usd |
FLOAT | Same |
Migration: alembic/versions/0021_csv_format_template.py (based on 0020).
No raw CSV bytes are ever stored. headers and sample_dummy are the only
payloads. sample_dummy is synthesised post-extraction by replacing column
values with placeholder strings ("TICKER", "100", "1.50") keyed to the
mapping — the operator can eyeball the format shape without seeing any real
holdings.
Components
app/services/llm_csv_parser.py — new
Public surface:
async def parse_with_llm(
raw: bytes,
session: AsyncSession,
) -> ParsedPie:
"""LLM-fallback CSV parser.
Decodes raw bytes, detects delimiter and preamble offset, fingerprints
the header row, hits the csv_format_templates cache. On miss, calls
openrouter.call_llm with headers + 3-5 sample rows to extract a
column-mapping, validates it, persists a new template, and applies the
mapping. Returns the same ParsedPie shape as parse_t212_csv.
"""
class LLMParseError(ValueError):
"""Raised when the LLM call fails or returns an unusable mapping."""
Internal helpers (not exported):
_detect_dialect(raw: bytes) -> tuple[str, int]— returns(delimiter, preamble_rows). Uses Python'scsv.Snifferfor delimiter, then walks rows until the first row whose tokens look like column headers (heuristic: all-strings, none parse as numbers)._fingerprint(headers: list[str]) -> str— lowercases, strips whitespace, joins with|, returns sha256 hex._extract_mapping_via_llm(client, headers, samples) -> dict— builds the system prompt, callsopenrouter.call_llm, parses the JSON envelope, raisesLLMParseErroron malformed output._validate_mapping(mapping, headers, first_row) -> None— every named column must exist inheaders;qty_col's value onfirst_rowmust parse as a positive number;cost_col(if present) must parse as a number. RaisesLLMParseErroron failure._apply_mapping(rows, mapping) -> ParsedPie— iterates remaining rows, buildsParsedPositioninstances, computes totals fromqty * avg_costwhen explicit totals aren't present._synthesise_dummy(headers, mapping) -> dict— produces the placeholder row forsample_dummy.
Reuses without modification:
app/services/openrouter.py::call_llm— provider fallback chain + AICall ledger loggingapp/services/csv_import.py::ParsedPie, ParsedPosition, CSVImportError— same return type, same error hierarchy.LLMParseErrorinherits fromCSVImportErrorso the route can catch both as one.
app/routers/universe.py::parse_portfolio — modified
Two small changes:
- Add
Depends(require_paid)to the route decorator. (Portfolio import has always been advertised as paid; this aligns the implementation.) - Wrap the existing
parse_t212_csvcall in a try/except that falls through toparse_with_llmonCSVImportError:
try:
pie = parse_t212_csv(raw)
except CSVImportError:
from app.services.llm_csv_parser import parse_with_llm, LLMParseError
try:
pie = await parse_with_llm(raw, session)
except LLMParseError as e:
raise HTTPException(status_code=400, detail=str(e))
Everything below this point in the function — resolve_slice loop, upsert_tickers, inline Yahoo fetch, response build — is unchanged. pie has the same shape regardless of branch.
app/models.py — new model
CsvFormatTemplate declared alongside the other tables. Columns as in the data model table above.
app/templates/settings.html — copy tweak
- Section heading: "Import portfolio (Trading 212 CSV)" → "Import portfolio (CSV)"
- Drop-zone label: "Drop a T212 pie CSV here" → "Drop your broker's portfolio CSV here"
- Drop-zone hint: append " · T212, IBKR, and others auto-detected" after the size limit
- The "Export your pie from T212" instructions paragraph stays as a help link — T212 is still the best-documented happy path — but its phrasing softens to "If you use Trading 212…"
LLM prompt shape
System prompt fixes the schema. User message contains headers + samples.
SYSTEM: You are an expert at recognising broker portfolio CSV formats.
You will be given the header row and 3-5 sample data rows from a CSV.
Identify which column contains each field. Return ONLY JSON, no prose.
Schema:
{
"ticker_col": "<header name or null>",
"qty_col": "<header name or null>",
"name_col": "<header name or null>",
"cost_col": "<header name or null>", // average price per share or unit cost
"currency_col": "<header name or null>",
"broker_label": "<short identifier like 'IBKR Activity Statement' or null>"
}
Rules:
- Use null when no column is a good match.
- ticker_col and qty_col are required; if either is missing return all nulls.
- Use the EXACT header string as it appears in the input.
USER: headers: ["Symbol","Position","Avg Price","Currency"]
samples:
AAPL,100,150.00,USD
MSFT,50,300.00,USD
...
The LLM never sees the entire file; it sees only the first ~5 data rows. Token cost is bounded and uniform regardless of portfolio size.
Error handling
| Failure | Response | Ledger |
|---|---|---|
| LLM provider down | 502 "couldn't parse — try again later" | AICall status=failed |
| LLM returns non-JSON | 400 "couldn't recognise as portfolio CSV" | AICall status=ok, no template stored |
| Mapping missing required columns (ticker/qty) | 400 same | AICall status=ok, no template stored |
| Mapping references non-existent column | 400 same | AICall status=ok, no template stored |
| Mapping validates but row parse fails on numerics | 400 same | template NOT stored |
| Cache hit but row parse fails (format drifted under us) | 400 + evict the stale template in its own commit before raising | — |
The "delete stale template on parse failure" rule is the only self-healing behaviour: if a broker quietly changes their export shape, the next failed re-import evicts the old mapping in a dedicated commit (so the eviction survives the request-failure rollback) and the LLM gets another shot on the subsequent upload. Without this, a once-good template would haunt the cache forever. We do not auto-retry the LLM in the same request — too much hidden cost on a single user action.
Testing
tests/test_llm_csv_parser.py:
- Fingerprint stability — case/whitespace/BOM variants of the same headers hash to the same fingerprint.
- Cache hit path — pre-populate a
CsvFormatTemplaterow, mockcall_llmto fail loudly, assert it is NOT called, assert positions come out correct. - Cache miss path — mock
call_llmto return a valid mapping JSON, assert a row is inserted, assert positions come out correct, assert the synthesisedsample_dummycontains placeholder strings only. - LLM returns malformed JSON — raises
LLMParseError, no template stored. - LLM maps to non-existent column — raises
LLMParseError, no template stored. - LLM maps qty to a non-numeric column — raises
LLMParseErroron validation. - Stale template self-heal — pre-populate a template that no longer matches the file, simulate row-parse failure, assert the row is deleted and a 400 returned.
- Integration — POST a fabricated IBKR-shaped fixture to
/api/portfolio/parse, assert ParsedPie round-trips, assert no second LLM call on a repeat upload.
Existing tests/test_csv_import.py must still pass — the T212 happy path is unchanged.
Verification
End-to-end manual check after deploy:
- Upload a T212 fixture → exists path stays unchanged (same dashboard load behaviour).
- Upload a fabricated IBKR CSV → first upload calls LLM, returns positions, template row created in DB.
- Re-upload the same IBKR CSV → second call has zero LLM cost (verify by counting
ai_callsrows before/after),use_countincrements to 2. - Inspect
csv_format_templatesrow: confirmheadersmatches the upload's headers,sample_dummycontains placeholder strings, no real holdings anywhere. - Upload random garbage (e.g. a screenshot renamed
.csv) → 400 with clean error, no template stored, AICall row logged. - Free-tier account attempts import → 402 (paid gating).
Open questions for the implementation plan
- Whether to read sample rows with
csv.readerand re-encode them as text for the LLM (safer for embedded commas/quotes), or pass the raw first-N-lines verbatim. Default: the safer reader path. - Whether to cap LLM-parsed portfolios at the same 1 MB limit as T212 (yes) and whether to add a separate cap on number-of-rows fed to the LLM as samples (yes, 5).
- Whether to log the fingerprint to the request log on cache hit/miss for operability. Default: yes, at INFO level, with
event_type="csv.format.cache_hit"/"csv.format.cache_miss".