docs: spec for LLM-fallback CSV parser
Transparent fallback after parse_t212_csv: LLM extracts a column-mapping (not the data), result is cached globally by header fingerprint, replay is deterministic Python. Stored dummy contains headers + synthetic row only — no user holdings ever persisted.
This commit is contained in:
parent
1be0c5a436
commit
263ecc0d3b
1 changed files with 272 additions and 0 deletions
|
|
@ -0,0 +1,272 @@
|
||||||
|
# LLM-fallback CSV parser — Design Spec
|
||||||
|
|
||||||
|
**Date:** 2026-05-27
|
||||||
|
**Status:** Draft — pending implementation plan
|
||||||
|
|
||||||
|
## Context
|
||||||
|
|
||||||
|
Today the only supported broker import is Trading 212. `parse_t212_csv` expects
|
||||||
|
T212's exact column set (`Slice`, `Owned quantity`, etc.) and raises
|
||||||
|
`CSVImportError` on anything else. Every non-T212 user hits a wall at
|
||||||
|
onboarding.
|
||||||
|
|
||||||
|
Rather than write a hand-rolled parser per broker (IBKR, Vanguard, Fidelity,
|
||||||
|
Schwab, eToro, Degiro, …) — and chase format drift forever — we use an LLM as
|
||||||
|
a transparent fallback. The LLM never sees holdings as data; it only looks at
|
||||||
|
**headers plus a handful of sample rows** and returns a JSON column-mapping.
|
||||||
|
Our existing Python code does the row iteration.
|
||||||
|
|
||||||
|
The first time a broker format appears, the LLM produces a mapping. We
|
||||||
|
fingerprint the format (sha256 of normalized headers) and cache the mapping
|
||||||
|
in a new `csv_format_templates` table. Every subsequent upload of the same
|
||||||
|
format — by any user — replays the cached mapping deterministically, with no
|
||||||
|
LLM call.
|
||||||
|
|
||||||
|
The cache row stores headers and a synthetic placeholder ("dummy") row, never
|
||||||
|
real user data. The mapping itself is a column-name dictionary, also free of
|
||||||
|
holdings.
|
||||||
|
|
||||||
|
Portfolio import is already advertised as a paid-only feature; we make that
|
||||||
|
explicit at the route level as part of this work.
|
||||||
|
|
||||||
|
## Goals
|
||||||
|
|
||||||
|
- Accept CSV exports from any broker, not just T212.
|
||||||
|
- Pay the LLM cost only once per **format**, not once per user.
|
||||||
|
- Never persist user holdings on the server (already a system-wide invariant).
|
||||||
|
- Surface the same response shape to the browser regardless of which parser
|
||||||
|
branch ran — no client changes beyond a copy tweak.
|
||||||
|
|
||||||
|
## Non-goals
|
||||||
|
|
||||||
|
- Per-broker UI customisation. The drop-zone stays generic.
|
||||||
|
- A human admin queue for reviewing LLM-discovered formats. Operator can
|
||||||
|
inspect rows directly in the DB if curious.
|
||||||
|
- Promoting "trusted" formats to native parsers. That's a future evolution
|
||||||
|
if a single broker dominates LLM-parsed traffic.
|
||||||
|
- Multi-stage / verification LLM passes. One call per first-time format.
|
||||||
|
|
||||||
|
## Architecture
|
||||||
|
|
||||||
|
```
|
||||||
|
POST /api/portfolio/parse (paid-only)
|
||||||
|
├─ parse_t212_csv(raw) ── happy path, unchanged
|
||||||
|
│ └─ CSVImportError ↴
|
||||||
|
│
|
||||||
|
├─ parse_with_llm(raw, session)
|
||||||
|
│ ├─ detect delimiter + preamble offset
|
||||||
|
│ ├─ fingerprint = sha256(normalised headers)
|
||||||
|
│ ├─ SELECT csv_format_templates WHERE fingerprint=?
|
||||||
|
│ │ ├─ HIT → apply mapping (bump use_count/last_used_at after successful parse)
|
||||||
|
│ │ └─ MISS → openrouter.call_llm(headers + 3-5 sample rows)
|
||||||
|
│ │ → validate mapping
|
||||||
|
│ │ → INSERT csv_format_templates
|
||||||
|
│ │ → apply mapping
|
||||||
|
│ └─ returns ParsedPie (same shape as T212 path)
|
||||||
|
│
|
||||||
|
└─ resolve_slice → upsert_tickers → inline Yahoo fetch → JSON response
|
||||||
|
(existing pipeline, unchanged)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Why column-mapping, not full extraction
|
||||||
|
|
||||||
|
We pass the LLM only **headers plus 3–5 sample rows**, not the full CSV. The
|
||||||
|
LLM returns column names, not transcribed numbers. Three benefits:
|
||||||
|
|
||||||
|
1. **Safety** — LLMs hallucinate digits; they don't hallucinate column names
|
||||||
|
that aren't there. Mapping validation can verify every named column exists
|
||||||
|
in the actual header row.
|
||||||
|
2. **Cost** — prompt is ~1 KB regardless of portfolio size.
|
||||||
|
3. **Cacheability** — the mapping IS the cache. Replay is deterministic Python,
|
||||||
|
no LLM in the loop on re-imports.
|
||||||
|
|
||||||
|
### Why global cache, not per-user
|
||||||
|
|
||||||
|
The column structure of an IBKR Activity Statement is a property of IBKR, not
|
||||||
|
of any individual user. The stored "dummy" contains no PII (column headers
|
||||||
|
are public; the sample row is synthetic). So global cache is strictly better:
|
||||||
|
faster onboarding for the second IBKR user, and the operator still gets a
|
||||||
|
`first_seen_user_id` audit column for forensic traceability.
|
||||||
|
|
||||||
|
## Data model
|
||||||
|
|
||||||
|
New table `csv_format_templates`:
|
||||||
|
|
||||||
|
| Column | Type | Notes |
|
||||||
|
|---|---|---|
|
||||||
|
| `id` | int PK | |
|
||||||
|
| `fingerprint` | `VARCHAR(64) UNIQUE NOT NULL` | sha256 hex of normalised header tuple |
|
||||||
|
| `headers` | JSON | List of strings — actual header row, no PII |
|
||||||
|
| `sample_dummy` | JSON | One synthetic placeholder row for human eyeball |
|
||||||
|
| `mapping` | JSON | `{ticker_col, qty_col, name_col, cost_col, currency_col}` |
|
||||||
|
| `preamble_rows` | INT NOT NULL DEFAULT 0 | Non-data lines before the header row |
|
||||||
|
| `delimiter` | CHAR(1) NOT NULL DEFAULT ',' | |
|
||||||
|
| `broker_label` | VARCHAR(128) | LLM-identified label, e.g. "Interactive Brokers Activity Statement" |
|
||||||
|
| `first_seen_user_id` | INT NULL, FK users(id) ON DELETE SET NULL | Audit only |
|
||||||
|
| `first_seen_at` | DATETIME(tz) NOT NULL | |
|
||||||
|
| `use_count` | INT NOT NULL DEFAULT 1 | Bumped on cache hit |
|
||||||
|
| `last_used_at` | DATETIME(tz) NOT NULL | |
|
||||||
|
| `llm_model` | VARCHAR(64) | Provenance of the initial extraction |
|
||||||
|
| `llm_cost_usd` | FLOAT | Same |
|
||||||
|
|
||||||
|
Migration: `alembic/versions/0021_csv_format_template.py` (based on `0020`).
|
||||||
|
|
||||||
|
No raw CSV bytes are ever stored. `headers` and `sample_dummy` are the only
|
||||||
|
payloads. `sample_dummy` is synthesised post-extraction by replacing column
|
||||||
|
values with placeholder strings (`"TICKER"`, `"100"`, `"1.50"`) keyed to the
|
||||||
|
mapping — the operator can eyeball the format shape without seeing any real
|
||||||
|
holdings.
|
||||||
|
|
||||||
|
## Components
|
||||||
|
|
||||||
|
### `app/services/llm_csv_parser.py` — new
|
||||||
|
|
||||||
|
Public surface:
|
||||||
|
|
||||||
|
```python
|
||||||
|
async def parse_with_llm(
|
||||||
|
raw: bytes,
|
||||||
|
session: AsyncSession,
|
||||||
|
) -> ParsedPie:
|
||||||
|
"""LLM-fallback CSV parser.
|
||||||
|
|
||||||
|
Decodes raw bytes, detects delimiter and preamble offset, fingerprints
|
||||||
|
the header row, hits the csv_format_templates cache. On miss, calls
|
||||||
|
openrouter.call_llm with headers + 3-5 sample rows to extract a
|
||||||
|
column-mapping, validates it, persists a new template, and applies the
|
||||||
|
mapping. Returns the same ParsedPie shape as parse_t212_csv.
|
||||||
|
"""
|
||||||
|
|
||||||
|
class LLMParseError(ValueError):
|
||||||
|
"""Raised when the LLM call fails or returns an unusable mapping."""
|
||||||
|
```
|
||||||
|
|
||||||
|
Internal helpers (not exported):
|
||||||
|
|
||||||
|
- `_detect_dialect(raw: bytes) -> tuple[str, int]` — returns `(delimiter, preamble_rows)`. Uses Python's `csv.Sniffer` for delimiter, then walks rows until the first row whose tokens look like column headers (heuristic: all-strings, none parse as numbers).
|
||||||
|
- `_fingerprint(headers: list[str]) -> str` — lowercases, strips whitespace, joins with `|`, returns sha256 hex.
|
||||||
|
- `_extract_mapping_via_llm(client, headers, samples) -> dict` — builds the system prompt, calls `openrouter.call_llm`, parses the JSON envelope, raises `LLMParseError` on malformed output.
|
||||||
|
- `_validate_mapping(mapping, headers, first_row) -> None` — every named column must exist in `headers`; `qty_col`'s value on `first_row` must parse as a positive number; `cost_col` (if present) must parse as a number. Raises `LLMParseError` on failure.
|
||||||
|
- `_apply_mapping(rows, mapping) -> ParsedPie` — iterates remaining rows, builds `ParsedPosition` instances, computes totals from `qty * avg_cost` when explicit totals aren't present.
|
||||||
|
- `_synthesise_dummy(headers, mapping) -> dict` — produces the placeholder row for `sample_dummy`.
|
||||||
|
|
||||||
|
Reuses without modification:
|
||||||
|
|
||||||
|
- `app/services/openrouter.py::call_llm` — provider fallback chain + AICall ledger logging
|
||||||
|
- `app/services/csv_import.py::ParsedPie, ParsedPosition, CSVImportError` — same return type, same error hierarchy. `LLMParseError` inherits from `CSVImportError` so the route can catch both as one.
|
||||||
|
|
||||||
|
### `app/routers/universe.py::parse_portfolio` — modified
|
||||||
|
|
||||||
|
Two small changes:
|
||||||
|
|
||||||
|
1. Add `Depends(require_paid)` to the route decorator. (Portfolio import has always been advertised as paid; this aligns the implementation.)
|
||||||
|
2. Wrap the existing `parse_t212_csv` call in a try/except that falls through to `parse_with_llm` on `CSVImportError`:
|
||||||
|
|
||||||
|
```python
|
||||||
|
try:
|
||||||
|
pie = parse_t212_csv(raw)
|
||||||
|
except CSVImportError:
|
||||||
|
from app.services.llm_csv_parser import parse_with_llm, LLMParseError
|
||||||
|
try:
|
||||||
|
pie = await parse_with_llm(raw, session)
|
||||||
|
except LLMParseError as e:
|
||||||
|
raise HTTPException(status_code=400, detail=str(e))
|
||||||
|
```
|
||||||
|
|
||||||
|
Everything below this point in the function — resolve_slice loop, upsert_tickers, inline Yahoo fetch, response build — is unchanged. `pie` has the same shape regardless of branch.
|
||||||
|
|
||||||
|
### `app/models.py` — new model
|
||||||
|
|
||||||
|
`CsvFormatTemplate` declared alongside the other tables. Columns as in the data model table above.
|
||||||
|
|
||||||
|
### `app/templates/settings.html` — copy tweak
|
||||||
|
|
||||||
|
- Section heading: "Import portfolio (Trading 212 CSV)" → "Import portfolio (CSV)"
|
||||||
|
- Drop-zone label: "Drop a T212 pie CSV here" → "Drop your broker's portfolio CSV here"
|
||||||
|
- Drop-zone hint: append " · T212, IBKR, and others auto-detected" after the size limit
|
||||||
|
- The "Export your pie from T212" instructions paragraph stays as a help link — T212 is still the best-documented happy path — but its phrasing softens to "If you use Trading 212…"
|
||||||
|
|
||||||
|
## LLM prompt shape
|
||||||
|
|
||||||
|
System prompt fixes the schema. User message contains headers + samples.
|
||||||
|
|
||||||
|
```
|
||||||
|
SYSTEM: You are an expert at recognising broker portfolio CSV formats.
|
||||||
|
You will be given the header row and 3-5 sample data rows from a CSV.
|
||||||
|
Identify which column contains each field. Return ONLY JSON, no prose.
|
||||||
|
|
||||||
|
Schema:
|
||||||
|
{
|
||||||
|
"ticker_col": "<header name or null>",
|
||||||
|
"qty_col": "<header name or null>",
|
||||||
|
"name_col": "<header name or null>",
|
||||||
|
"cost_col": "<header name or null>", // average price per share or unit cost
|
||||||
|
"currency_col": "<header name or null>",
|
||||||
|
"broker_label": "<short identifier like 'IBKR Activity Statement' or null>"
|
||||||
|
}
|
||||||
|
|
||||||
|
Rules:
|
||||||
|
- Use null when no column is a good match.
|
||||||
|
- ticker_col and qty_col are required; if either is missing return all nulls.
|
||||||
|
- Use the EXACT header string as it appears in the input.
|
||||||
|
|
||||||
|
USER: headers: ["Symbol","Position","Avg Price","Currency"]
|
||||||
|
samples:
|
||||||
|
AAPL,100,150.00,USD
|
||||||
|
MSFT,50,300.00,USD
|
||||||
|
...
|
||||||
|
```
|
||||||
|
|
||||||
|
The LLM never sees the entire file; it sees only the first ~5 data rows.
|
||||||
|
Token cost is bounded and uniform regardless of portfolio size.
|
||||||
|
|
||||||
|
## Error handling
|
||||||
|
|
||||||
|
| Failure | Response | Ledger |
|
||||||
|
|---|---|---|
|
||||||
|
| LLM provider down | 502 "couldn't parse — try again later" | AICall status=failed |
|
||||||
|
| LLM returns non-JSON | 400 "couldn't recognise as portfolio CSV" | AICall status=ok, no template stored |
|
||||||
|
| Mapping missing required columns (ticker/qty) | 400 same | AICall status=ok, no template stored |
|
||||||
|
| Mapping references non-existent column | 400 same | AICall status=ok, no template stored |
|
||||||
|
| Mapping validates but row parse fails on numerics | 400 same | template NOT stored |
|
||||||
|
| Cache hit but row parse fails (format drifted under us) | 400 + evict the stale template in its own commit before raising | — |
|
||||||
|
|
||||||
|
The "delete stale template on parse failure" rule is the only self-healing
|
||||||
|
behaviour: if a broker quietly changes their export shape, the next failed
|
||||||
|
re-import evicts the old mapping in a dedicated commit (so the eviction
|
||||||
|
survives the request-failure rollback) and the LLM gets another shot on the
|
||||||
|
subsequent upload. Without this, a once-good template would haunt the cache
|
||||||
|
forever. We do **not** auto-retry the LLM in the same request — too much
|
||||||
|
hidden cost on a single user action.
|
||||||
|
|
||||||
|
## Testing
|
||||||
|
|
||||||
|
`tests/test_llm_csv_parser.py`:
|
||||||
|
|
||||||
|
- **Fingerprint stability** — case/whitespace/BOM variants of the same headers hash to the same fingerprint.
|
||||||
|
- **Cache hit path** — pre-populate a `CsvFormatTemplate` row, mock `call_llm` to fail loudly, assert it is NOT called, assert positions come out correct.
|
||||||
|
- **Cache miss path** — mock `call_llm` to return a valid mapping JSON, assert a row is inserted, assert positions come out correct, assert the synthesised `sample_dummy` contains placeholder strings only.
|
||||||
|
- **LLM returns malformed JSON** — raises `LLMParseError`, no template stored.
|
||||||
|
- **LLM maps to non-existent column** — raises `LLMParseError`, no template stored.
|
||||||
|
- **LLM maps qty to a non-numeric column** — raises `LLMParseError` on validation.
|
||||||
|
- **Stale template self-heal** — pre-populate a template that no longer matches the file, simulate row-parse failure, assert the row is deleted and a 400 returned.
|
||||||
|
- **Integration** — POST a fabricated IBKR-shaped fixture to `/api/portfolio/parse`, assert ParsedPie round-trips, assert no second LLM call on a repeat upload.
|
||||||
|
|
||||||
|
Existing `tests/test_csv_import.py` must still pass — the T212 happy path is unchanged.
|
||||||
|
|
||||||
|
## Verification
|
||||||
|
|
||||||
|
End-to-end manual check after deploy:
|
||||||
|
|
||||||
|
1. Upload a T212 fixture → exists path stays unchanged (same dashboard load behaviour).
|
||||||
|
2. Upload a fabricated IBKR CSV → first upload calls LLM, returns positions, template row created in DB.
|
||||||
|
3. Re-upload the same IBKR CSV → second call has zero LLM cost (verify by counting `ai_calls` rows before/after), `use_count` increments to 2.
|
||||||
|
4. Inspect `csv_format_templates` row: confirm `headers` matches the upload's headers, `sample_dummy` contains placeholder strings, no real holdings anywhere.
|
||||||
|
5. Upload random garbage (e.g. a screenshot renamed `.csv`) → 400 with clean error, no template stored, AICall row logged.
|
||||||
|
6. Free-tier account attempts import → 402 (paid gating).
|
||||||
|
|
||||||
|
## Open questions for the implementation plan
|
||||||
|
|
||||||
|
- Whether to read sample rows with `csv.reader` and re-encode them as text for the LLM (safer for embedded commas/quotes), or pass the raw first-N-lines verbatim. Default: the safer reader path.
|
||||||
|
- Whether to cap LLM-parsed portfolios at the same 1 MB limit as T212 (yes) and whether to add a separate cap on number-of-rows fed to the LLM as samples (yes, 5).
|
||||||
|
- Whether to log the fingerprint to the request log on cache hit/miss for operability. Default: yes, at INFO level, with `event_type="csv.format.cache_hit"` / `"csv.format.cache_miss"`.
|
||||||
Loading…
Add table
Add a link
Reference in a new issue