Giorgio Gilestro 0254515989 docs: refine LLM-CSV spec — keep real sample row, drop user attribution

- Drop first_seen_user_id; sample is anonymous by construction
- Rename sample_dummy → sample_row, store the upload's first real data
  row verbatim (one row, no totals, no other positions, no link to a
  user). Narrow, deliberate exception to the "no holdings persisted"
  invariant — gives the operator material for hand-writing future
  native parsers.
- Drop the cache self-heal behaviour; operator owns eviction. Reinforce
  the non-goal of auto-promoting learned formats to code.

2026-05-27 11:21:43 +02:00

14 KiB

Raw Permalink Blame History

LLM-fallback CSV parser — Design Spec

Date: 2026-05-27 Status: Draft — pending implementation plan

Context

Today the only supported broker import is Trading 212. parse_t212_csv expects T212's exact column set (Slice, Owned quantity, etc.) and raises CSVImportError on anything else. Every non-T212 user hits a wall at onboarding.

Rather than write a hand-rolled parser per broker (IBKR, Vanguard, Fidelity, Schwab, eToro, Degiro, …) — and chase format drift forever — we use an LLM as a transparent fallback. The LLM never sees holdings as data; it only looks at headers plus a handful of sample rows and returns a JSON column-mapping. Our existing Python code does the row iteration.

The first time a broker format appears, the LLM produces a mapping. We fingerprint the format (sha256 of normalized headers) and cache the mapping in a new csv_format_templates table. Every subsequent upload of the same format — by any user — replays the cached mapping deterministically, with no LLM call.

The cache row stores the header row and a single anonymous sample data row (the first row from the originating upload, verbatim). No user identifier is recorded — the row is not linked back to whoever uploaded it. The purpose of the sample is to give the operator material to look at when designing future native parsers; this collection is passive learning only, the system never attempts to author or modify parser code automatically.

Portfolio import is already advertised as a paid-only feature; we make that explicit at the route level as part of this work.

Goals

Accept CSV exports from any broker, not just T212.
Pay the LLM cost only once per format, not once per user.
Never persist user holdings on the server (already a system-wide invariant).
Surface the same response shape to the browser regardless of which parser branch ran — no client changes beyond a copy tweak.

Non-goals

Per-broker UI customisation. The drop-zone stays generic.
A human admin queue for reviewing LLM-discovered formats. Operator can inspect rows directly in the DB if curious.
Auto-promoting learned formats to native parsers. The operator will hand-write any native parser by looking at the collected sample rows. The system never writes or modifies code.
Self-healing or auto-evicting stale cache entries. If a broker silently changes their export shape under us, the cached mapping will start producing parse errors; the operator deletes the row manually. We do not invalidate cache entries automatically.
Multi-stage / verification LLM passes. One call per first-time format.

Architecture

POST /api/portfolio/parse  (paid-only)
├─ parse_t212_csv(raw)             ── happy path, unchanged
│   └─ CSVImportError ↴
│
├─ parse_with_llm(raw, session)
│   ├─ detect delimiter + preamble offset
│   ├─ fingerprint = sha256(normalised headers)
│   ├─ SELECT csv_format_templates WHERE fingerprint=?
│   │   ├─ HIT  → apply mapping (bump use_count/last_used_at after successful parse)
│   │   └─ MISS → openrouter.call_llm(headers + 3-5 sample rows)
│   │             → validate mapping
│   │             → INSERT csv_format_templates
│   │             → apply mapping
│   └─ returns ParsedPie  (same shape as T212 path)
│
└─ resolve_slice → upsert_tickers → inline Yahoo fetch → JSON response
   (existing pipeline, unchanged)

Why column-mapping, not full extraction

We pass the LLM only headers plus 3–5 sample rows, not the full CSV. The LLM returns column names, not transcribed numbers. Three benefits:

Safety — LLMs hallucinate digits; they don't hallucinate column names that aren't there. Mapping validation can verify every named column exists in the actual header row.
Cost — prompt is ~1 KB regardless of portfolio size.
Cacheability — the mapping IS the cache. Replay is deterministic Python, no LLM in the loop on re-imports.

Why global cache, not per-user

The column structure of an IBKR Activity Statement is a property of IBKR, not of any individual user. The cache row contains no user identifier — the sample data row is stored verbatim but anonymously, with nothing linking it to the uploader. Global cache is strictly better: faster onboarding for the second IBKR user, and the collected samples form a small, useful corpus for hand-writing native parsers later.

Data model

New table csv_format_templates:

Column	Type	Notes
`id`	int PK
`fingerprint`	`VARCHAR(64) UNIQUE NOT NULL`	sha256 hex of normalised header tuple
`headers`	JSON	List of strings — actual header row from the upload
`sample_row`	JSON	First data row from the originating upload, verbatim. Not linked to any user.
`mapping`	JSON	`{ticker_col, qty_col, name_col, cost_col, currency_col}`
`preamble_rows`	INT NOT NULL DEFAULT 0	Non-data lines before the header row
`delimiter`	CHAR(1) NOT NULL DEFAULT ','
`broker_label`	VARCHAR(128)	LLM-identified label, e.g. "Interactive Brokers Activity Statement"
`first_seen_at`	DATETIME(tz) NOT NULL	When the format was first cached
`use_count`	INT NOT NULL DEFAULT 1	Bumped on each successful cache hit
`last_used_at`	DATETIME(tz) NOT NULL
`llm_model`	VARCHAR(64)	Provenance of the initial extraction
`llm_cost_usd`	FLOAT	Same

Migration: alembic/versions/0021_csv_format_template.py (based on 0020).

The full uploaded CSV is not stored — only the header row plus a single data row (sample_row). No user_id column exists on this table; the sample is anonymous by construction. This is a deliberate, narrow exception to the otherwise-strict "no holdings persisted" invariant: we keep one row per format so the operator has concrete material to look at when hand-writing a future native parser. One anonymous row carries no portfolio context (no totals, no other positions) and cannot be linked back to an account.

Components

`app/services/llm_csv_parser.py` — new

Public surface:

async def parse_with_llm(
    raw: bytes,
    session: AsyncSession,
) -> ParsedPie:
    """LLM-fallback CSV parser.

    Decodes raw bytes, detects delimiter and preamble offset, fingerprints
    the header row, hits the csv_format_templates cache. On miss, calls
    openrouter.call_llm with headers + 3-5 sample rows to extract a
    column-mapping, validates it, persists a new template, and applies the
    mapping. Returns the same ParsedPie shape as parse_t212_csv.
    """

class LLMParseError(ValueError):
    """Raised when the LLM call fails or returns an unusable mapping."""

Internal helpers (not exported):

_detect_dialect(raw: bytes) -> tuple[str, int] — returns (delimiter, preamble_rows). Uses Python's csv.Sniffer for delimiter, then walks rows until the first row whose tokens look like column headers (heuristic: all-strings, none parse as numbers).
_fingerprint(headers: list[str]) -> str — lowercases, strips whitespace, joins with |, returns sha256 hex.
_extract_mapping_via_llm(client, headers, samples) -> dict — builds the system prompt, calls openrouter.call_llm, parses the JSON envelope, raises LLMParseError on malformed output.
_validate_mapping(mapping, headers, first_row) -> None — every named column must exist in headers; qty_col's value on first_row must parse as a positive number; cost_col (if present) must parse as a number. Raises LLMParseError on failure.
_apply_mapping(rows, mapping) -> ParsedPie — iterates remaining rows, builds ParsedPosition instances, computes totals from qty * avg_cost when explicit totals aren't present.

Reuses without modification:

app/services/openrouter.py::call_llm — provider fallback chain + AICall ledger logging
app/services/csv_import.py::ParsedPie, ParsedPosition, CSVImportError — same return type, same error hierarchy. LLMParseError inherits from CSVImportError so the route can catch both as one.

`app/routers/universe.py::parse_portfolio` — modified

Two small changes:

Add Depends(require_paid) to the route decorator. (Portfolio import has always been advertised as paid; this aligns the implementation.)
Wrap the existing parse_t212_csv call in a try/except that falls through to parse_with_llm on CSVImportError:

try:
    pie = parse_t212_csv(raw)
except CSVImportError:
    from app.services.llm_csv_parser import parse_with_llm, LLMParseError
    try:
        pie = await parse_with_llm(raw, session)
    except LLMParseError as e:
        raise HTTPException(status_code=400, detail=str(e))

Everything below this point in the function — resolve_slice loop, upsert_tickers, inline Yahoo fetch, response build — is unchanged. pie has the same shape regardless of branch.

`app/models.py` — new model

CsvFormatTemplate declared alongside the other tables. Columns as in the data model table above.

`app/templates/settings.html` — copy tweak

Section heading: "Import portfolio (Trading 212 CSV)" → "Import portfolio (CSV)"
Drop-zone label: "Drop a T212 pie CSV here" → "Drop your broker's portfolio CSV here"
Drop-zone hint: append " · T212, IBKR, and others auto-detected" after the size limit
The "Export your pie from T212" instructions paragraph stays as a help link — T212 is still the best-documented happy path — but its phrasing softens to "If you use Trading 212…"

LLM prompt shape

System prompt fixes the schema. User message contains headers + samples.

SYSTEM: You are an expert at recognising broker portfolio CSV formats.
You will be given the header row and 3-5 sample data rows from a CSV.
Identify which column contains each field. Return ONLY JSON, no prose.

Schema:
{
  "ticker_col": "<header name or null>",
  "qty_col":    "<header name or null>",
  "name_col":   "<header name or null>",
  "cost_col":   "<header name or null>",    // average price per share or unit cost
  "currency_col": "<header name or null>",
  "broker_label": "<short identifier like 'IBKR Activity Statement' or null>"
}

Rules:
- Use null when no column is a good match.
- ticker_col and qty_col are required; if either is missing return all nulls.
- Use the EXACT header string as it appears in the input.

USER: headers: ["Symbol","Position","Avg Price","Currency"]
samples:
  AAPL,100,150.00,USD
  MSFT,50,300.00,USD
  ...

The LLM never sees the entire file; it sees only the first ~5 data rows. Token cost is bounded and uniform regardless of portfolio size.

Error handling

Failure	Response	Ledger
LLM provider down	502 "couldn't parse — try again later"	AICall status=failed
LLM returns non-JSON	400 "couldn't recognise as portfolio CSV"	AICall status=ok, no template stored
Mapping missing required columns (ticker/qty)	400 same	AICall status=ok, no template stored
Mapping references non-existent column	400 same	AICall status=ok, no template stored
Mapping validates but row parse fails on numerics	400 same	template NOT stored
Cache hit but row parse fails (format drifted under us)	400 with parse error	—

If a broker quietly changes their CSV shape such that a previously-good cached mapping starts producing parse failures, the user sees an error and the operator deletes the offending csv_format_templates row by hand. No automatic eviction, no automatic retry. The cache is a learning store, not a self-managing system.

Testing

tests/test_llm_csv_parser.py:

Fingerprint stability — case/whitespace/BOM variants of the same headers hash to the same fingerprint.
Cache hit path — pre-populate a CsvFormatTemplate row, mock call_llm to fail loudly, assert it is NOT called, assert positions come out correct, assert use_count is incremented.
Cache miss path — mock call_llm to return a valid mapping JSON, assert a row is inserted with the upload's actual first data row as sample_row and no user_id anywhere, assert positions come out correct.
LLM returns malformed JSON — raises LLMParseError, no template stored.
LLM maps to non-existent column — raises LLMParseError, no template stored.
LLM maps qty to a non-numeric column — raises LLMParseError on validation.
Stale cached mapping on parse failure — pre-populate a template whose mapping no longer matches the file content, assert a 400 is returned and the template is NOT deleted automatically (operator owns eviction).
Integration — POST a fabricated IBKR-shaped fixture to /api/portfolio/parse, assert ParsedPie round-trips, assert no second LLM call on a repeat upload.

Existing tests/test_csv_import.py must still pass — the T212 happy path is unchanged.

Verification

End-to-end manual check after deploy:

Upload a T212 fixture → exists path stays unchanged (same dashboard load behaviour).
Upload a fabricated IBKR CSV → first upload calls LLM, returns positions, template row created in DB.
Re-upload the same IBKR CSV → second call has zero LLM cost (verify by counting ai_calls rows before/after), use_count increments to 2.
Inspect csv_format_templates row: confirm headers matches the upload's headers, sample_row is the first real data row, no user_id column exists on the table.
Upload random garbage (e.g. a screenshot renamed .csv) → 400 with clean error, no template stored, AICall row logged.
Free-tier account attempts import → 402 (paid gating).

Open questions for the implementation plan

Whether to read sample rows with csv.reader and re-encode them as text for the LLM (safer for embedded commas/quotes), or pass the raw first-N-lines verbatim. Default: the safer reader path.
Whether to cap LLM-parsed portfolios at the same 1 MB limit as T212 (yes) and whether to add a separate cap on number-of-rows fed to the LLM as samples (yes, 5).
Whether to log the fingerprint to the request log on cache hit/miss for operability. Default: yes, at INFO level, with event_type="csv.format.cache_hit" / "csv.format.cache_miss".

14 KiB Raw Permalink Blame History Unescape Escape