docs: refine LLM-CSV spec — keep real sample row, drop user attribution

- Drop first_seen_user_id; sample is anonymous by construction
- Rename sample_dummy → sample_row, store the upload's first real data
  row verbatim (one row, no totals, no other positions, no link to a
  user). Narrow, deliberate exception to the "no holdings persisted"
  invariant — gives the operator material for hand-writing future
  native parsers.
- Drop the cache self-heal behaviour; operator owns eviction. Reinforce
  the non-goal of auto-promoting learned formats to code.
This commit is contained in:
Giorgio Gilestro 2026-05-27 11:21:43 +02:00
parent 263ecc0d3b
commit 0254515989

View file

@ -22,9 +22,12 @@ in a new `csv_format_templates` table. Every subsequent upload of the same
format — by any user — replays the cached mapping deterministically, with no
LLM call.
The cache row stores headers and a synthetic placeholder ("dummy") row, never
real user data. The mapping itself is a column-name dictionary, also free of
holdings.
The cache row stores the header row and a single anonymous sample data row
(the first row from the originating upload, verbatim). No user identifier is
recorded — the row is not linked back to whoever uploaded it. The purpose of
the sample is to give the operator material to look at when designing future
native parsers; this collection is **passive learning only**, the system
never attempts to author or modify parser code automatically.
Portfolio import is already advertised as a paid-only feature; we make that
explicit at the route level as part of this work.
@ -42,8 +45,13 @@ explicit at the route level as part of this work.
- Per-broker UI customisation. The drop-zone stays generic.
- A human admin queue for reviewing LLM-discovered formats. Operator can
inspect rows directly in the DB if curious.
- Promoting "trusted" formats to native parsers. That's a future evolution
if a single broker dominates LLM-parsed traffic.
- **Auto-promoting learned formats to native parsers.** The operator will
hand-write any native parser by looking at the collected sample rows. The
system never writes or modifies code.
- Self-healing or auto-evicting stale cache entries. If a broker silently
changes their export shape under us, the cached mapping will start
producing parse errors; the operator deletes the row manually. We do not
invalidate cache entries automatically.
- Multi-stage / verification LLM passes. One call per first-time format.
## Architecture
@ -83,10 +91,11 @@ LLM returns column names, not transcribed numbers. Three benefits:
### Why global cache, not per-user
The column structure of an IBKR Activity Statement is a property of IBKR, not
of any individual user. The stored "dummy" contains no PII (column headers
are public; the sample row is synthetic). So global cache is strictly better:
faster onboarding for the second IBKR user, and the operator still gets a
`first_seen_user_id` audit column for forensic traceability.
of any individual user. The cache row contains no user identifier — the
sample data row is stored verbatim but anonymously, with nothing linking it
to the uploader. Global cache is strictly better: faster onboarding for the
second IBKR user, and the collected samples form a small, useful corpus for
hand-writing native parsers later.
## Data model
@ -96,26 +105,27 @@ New table `csv_format_templates`:
|---|---|---|
| `id` | int PK | |
| `fingerprint` | `VARCHAR(64) UNIQUE NOT NULL` | sha256 hex of normalised header tuple |
| `headers` | JSON | List of strings — actual header row, no PII |
| `sample_dummy` | JSON | One synthetic placeholder row for human eyeball |
| `headers` | JSON | List of strings — actual header row from the upload |
| `sample_row` | JSON | First data row from the originating upload, verbatim. Not linked to any user. |
| `mapping` | JSON | `{ticker_col, qty_col, name_col, cost_col, currency_col}` |
| `preamble_rows` | INT NOT NULL DEFAULT 0 | Non-data lines before the header row |
| `delimiter` | CHAR(1) NOT NULL DEFAULT ',' | |
| `broker_label` | VARCHAR(128) | LLM-identified label, e.g. "Interactive Brokers Activity Statement" |
| `first_seen_user_id` | INT NULL, FK users(id) ON DELETE SET NULL | Audit only |
| `first_seen_at` | DATETIME(tz) NOT NULL | |
| `use_count` | INT NOT NULL DEFAULT 1 | Bumped on cache hit |
| `first_seen_at` | DATETIME(tz) NOT NULL | When the format was first cached |
| `use_count` | INT NOT NULL DEFAULT 1 | Bumped on each successful cache hit |
| `last_used_at` | DATETIME(tz) NOT NULL | |
| `llm_model` | VARCHAR(64) | Provenance of the initial extraction |
| `llm_cost_usd` | FLOAT | Same |
Migration: `alembic/versions/0021_csv_format_template.py` (based on `0020`).
No raw CSV bytes are ever stored. `headers` and `sample_dummy` are the only
payloads. `sample_dummy` is synthesised post-extraction by replacing column
values with placeholder strings (`"TICKER"`, `"100"`, `"1.50"`) keyed to the
mapping — the operator can eyeball the format shape without seeing any real
holdings.
The full uploaded CSV is **not** stored — only the header row plus a single
data row (`sample_row`). No `user_id` column exists on this table; the sample
is anonymous by construction. This is a deliberate, narrow exception to the
otherwise-strict "no holdings persisted" invariant: we keep one row per
format so the operator has concrete material to look at when hand-writing a
future native parser. One anonymous row carries no portfolio context (no
totals, no other positions) and cannot be linked back to an account.
## Components
@ -148,7 +158,6 @@ Internal helpers (not exported):
- `_extract_mapping_via_llm(client, headers, samples) -> dict` — builds the system prompt, calls `openrouter.call_llm`, parses the JSON envelope, raises `LLMParseError` on malformed output.
- `_validate_mapping(mapping, headers, first_row) -> None` — every named column must exist in `headers`; `qty_col`'s value on `first_row` must parse as a positive number; `cost_col` (if present) must parse as a number. Raises `LLMParseError` on failure.
- `_apply_mapping(rows, mapping) -> ParsedPie` — iterates remaining rows, builds `ParsedPosition` instances, computes totals from `qty * avg_cost` when explicit totals aren't present.
- `_synthesise_dummy(headers, mapping) -> dict` — produces the placeholder row for `sample_dummy`.
Reuses without modification:
@ -229,27 +238,25 @@ Token cost is bounded and uniform regardless of portfolio size.
| Mapping missing required columns (ticker/qty) | 400 same | AICall status=ok, no template stored |
| Mapping references non-existent column | 400 same | AICall status=ok, no template stored |
| Mapping validates but row parse fails on numerics | 400 same | template NOT stored |
| Cache hit but row parse fails (format drifted under us) | 400 + evict the stale template in its own commit before raising | — |
| Cache hit but row parse fails (format drifted under us) | 400 with parse error | — |
The "delete stale template on parse failure" rule is the only self-healing
behaviour: if a broker quietly changes their export shape, the next failed
re-import evicts the old mapping in a dedicated commit (so the eviction
survives the request-failure rollback) and the LLM gets another shot on the
subsequent upload. Without this, a once-good template would haunt the cache
forever. We do **not** auto-retry the LLM in the same request — too much
hidden cost on a single user action.
If a broker quietly changes their CSV shape such that a previously-good
cached mapping starts producing parse failures, the user sees an error and
the operator deletes the offending `csv_format_templates` row by hand. No
automatic eviction, no automatic retry. The cache is a learning store, not
a self-managing system.
## Testing
`tests/test_llm_csv_parser.py`:
- **Fingerprint stability** — case/whitespace/BOM variants of the same headers hash to the same fingerprint.
- **Cache hit path** — pre-populate a `CsvFormatTemplate` row, mock `call_llm` to fail loudly, assert it is NOT called, assert positions come out correct.
- **Cache miss path** — mock `call_llm` to return a valid mapping JSON, assert a row is inserted, assert positions come out correct, assert the synthesised `sample_dummy` contains placeholder strings only.
- **Cache hit path** — pre-populate a `CsvFormatTemplate` row, mock `call_llm` to fail loudly, assert it is NOT called, assert positions come out correct, assert `use_count` is incremented.
- **Cache miss path** — mock `call_llm` to return a valid mapping JSON, assert a row is inserted with the upload's actual first data row as `sample_row` and no user_id anywhere, assert positions come out correct.
- **LLM returns malformed JSON** — raises `LLMParseError`, no template stored.
- **LLM maps to non-existent column** — raises `LLMParseError`, no template stored.
- **LLM maps qty to a non-numeric column** — raises `LLMParseError` on validation.
- **Stale template self-heal** — pre-populate a template that no longer matches the file, simulate row-parse failure, assert the row is deleted and a 400 returned.
- **Stale cached mapping on parse failure** — pre-populate a template whose mapping no longer matches the file content, assert a 400 is returned and the template is NOT deleted automatically (operator owns eviction).
- **Integration** — POST a fabricated IBKR-shaped fixture to `/api/portfolio/parse`, assert ParsedPie round-trips, assert no second LLM call on a repeat upload.
Existing `tests/test_csv_import.py` must still pass — the T212 happy path is unchanged.
@ -261,7 +268,7 @@ End-to-end manual check after deploy:
1. Upload a T212 fixture → exists path stays unchanged (same dashboard load behaviour).
2. Upload a fabricated IBKR CSV → first upload calls LLM, returns positions, template row created in DB.
3. Re-upload the same IBKR CSV → second call has zero LLM cost (verify by counting `ai_calls` rows before/after), `use_count` increments to 2.
4. Inspect `csv_format_templates` row: confirm `headers` matches the upload's headers, `sample_dummy` contains placeholder strings, no real holdings anywhere.
4. Inspect `csv_format_templates` row: confirm `headers` matches the upload's headers, `sample_row` is the first real data row, no `user_id` column exists on the table.
5. Upload random garbage (e.g. a screenshot renamed `.csv`) → 400 with clean error, no template stored, AICall row logged.
6. Free-tier account attempts import → 402 (paid gating).