docs: refine LLM-CSV spec — keep real sample row, drop user attribution

- Drop first_seen_user_id; sample is anonymous by construction - Rename sample_dummy → sample_row, store the upload's first real data row verbatim (one row, no totals, no other positions, no link to a user). Narrow, deliberate exception to the "no holdings persisted" invariant — gives the operator material for hand-writing future native parsers. - Drop the cache self-heal behaviour; operator owns eviction. Reinforce the non-goal of auto-promoting learned formats to code.
2026-05-27 11:21:43 +02:00 · 2026-05-27 11:21:43 +02:00 · 0254515989
commit 0254515989
parent 263ecc0d3b
1 changed files with 39 additions and 32 deletions
--- a/docs/superpowers/specs/2026-05-27-llm-csv-fallback-parser-design.md
+++ b/docs/superpowers/specs/2026-05-27-llm-csv-fallback-parser-design.md
@ -22,9 +22,12 @@ in a new `csv_format_templates` table. Every subsequent upload of the same
 format — by any user — replays the cached mapping deterministically, with no
 LLM call.

-The cache row stores headers and a synthetic placeholder ("dummy") row, never
-real user data. The mapping itself is a column-name dictionary, also free of
-holdings.
+The cache row stores the header row and a single anonymous sample data row
+(the first row from the originating upload, verbatim). No user identifier is
+recorded — the row is not linked back to whoever uploaded it. The purpose of
+the sample is to give the operator material to look at when designing future
+native parsers; this collection is **passive learning only**, the system
+never attempts to author or modify parser code automatically.

 Portfolio import is already advertised as a paid-only feature; we make that
 explicit at the route level as part of this work.
@ -42,8 +45,13 @@ explicit at the route level as part of this work.
 - Per-broker UI customisation. The drop-zone stays generic.
 - A human admin queue for reviewing LLM-discovered formats. Operator can
  inspect rows directly in the DB if curious.
- Promoting "trusted" formats to native parsers. That's a future evolution
-  if a single broker dominates LLM-parsed traffic.
+- **Auto-promoting learned formats to native parsers.** The operator will
+  hand-write any native parser by looking at the collected sample rows. The
+  system never writes or modifies code.
+- Self-healing or auto-evicting stale cache entries. If a broker silently
+  changes their export shape under us, the cached mapping will start
+  producing parse errors; the operator deletes the row manually. We do not
+  invalidate cache entries automatically.
 - Multi-stage / verification LLM passes. One call per first-time format.

 ## Architecture
@ -83,10 +91,11 @@ LLM returns column names, not transcribed numbers. Three benefits:
 ### Why global cache, not per-user

 The column structure of an IBKR Activity Statement is a property of IBKR, not
-of any individual user. The stored "dummy" contains no PII (column headers
-are public; the sample row is synthetic). So global cache is strictly better:
-faster onboarding for the second IBKR user, and the operator still gets a
-`first_seen_user_id` audit column for forensic traceability.
+of any individual user. The cache row contains no user identifier — the
+sample data row is stored verbatim but anonymously, with nothing linking it
+to the uploader. Global cache is strictly better: faster onboarding for the
+second IBKR user, and the collected samples form a small, useful corpus for
+hand-writing native parsers later.

 ## Data model

@ -96,26 +105,27 @@ New table `csv_format_templates`:
 |---|---|---|
 | `id` | int PK | |
 | `fingerprint` | `VARCHAR(64) UNIQUE NOT NULL` | sha256 hex of normalised header tuple |
-| `headers` | JSON | List of strings — actual header row, no PII |
-| `sample_dummy` | JSON | One synthetic placeholder row for human eyeball |
+| `headers` | JSON | List of strings — actual header row from the upload |
+| `sample_row` | JSON | First data row from the originating upload, verbatim. Not linked to any user. |
 | `mapping` | JSON | `{ticker_col, qty_col, name_col, cost_col, currency_col}` |
 | `preamble_rows` | INT NOT NULL DEFAULT 0 | Non-data lines before the header row |
 | `delimiter` | CHAR(1) NOT NULL DEFAULT ',' | |
 | `broker_label` | VARCHAR(128) | LLM-identified label, e.g. "Interactive Brokers Activity Statement" |
-| `first_seen_user_id` | INT NULL, FK users(id) ON DELETE SET NULL | Audit only |
-| `first_seen_at` | DATETIME(tz) NOT NULL | |
-| `use_count` | INT NOT NULL DEFAULT 1 | Bumped on cache hit |
+| `first_seen_at` | DATETIME(tz) NOT NULL | When the format was first cached |
+| `use_count` | INT NOT NULL DEFAULT 1 | Bumped on each successful cache hit |
 | `last_used_at` | DATETIME(tz) NOT NULL | |
 | `llm_model` | VARCHAR(64) | Provenance of the initial extraction |
 | `llm_cost_usd` | FLOAT | Same |

 Migration: `alembic/versions/0021_csv_format_template.py` (based on `0020`).

-No raw CSV bytes are ever stored. `headers` and `sample_dummy` are the only
-payloads. `sample_dummy` is synthesised post-extraction by replacing column
-values with placeholder strings (`"TICKER"`, `"100"`, `"1.50"`) keyed to the
-mapping — the operator can eyeball the format shape without seeing any real
-holdings.
+The full uploaded CSV is **not** stored — only the header row plus a single
+data row (`sample_row`). No `user_id` column exists on this table; the sample
+is anonymous by construction. This is a deliberate, narrow exception to the
+otherwise-strict "no holdings persisted" invariant: we keep one row per
+format so the operator has concrete material to look at when hand-writing a
+future native parser. One anonymous row carries no portfolio context (no
+totals, no other positions) and cannot be linked back to an account.

 ## Components

@ -148,7 +158,6 @@ Internal helpers (not exported):
 - `_extract_mapping_via_llm(client, headers, samples) -> dict` — builds the system prompt, calls `openrouter.call_llm`, parses the JSON envelope, raises `LLMParseError` on malformed output.
 - `_validate_mapping(mapping, headers, first_row) -> None` — every named column must exist in `headers`; `qty_col`'s value on `first_row` must parse as a positive number; `cost_col` (if present) must parse as a number. Raises `LLMParseError` on failure.
 - `_apply_mapping(rows, mapping) -> ParsedPie` — iterates remaining rows, builds `ParsedPosition` instances, computes totals from `qty * avg_cost` when explicit totals aren't present.
- `_synthesise_dummy(headers, mapping) -> dict` — produces the placeholder row for `sample_dummy`.

 Reuses without modification:

@ -229,27 +238,25 @@ Token cost is bounded and uniform regardless of portfolio size.
 | Mapping missing required columns (ticker/qty) | 400 same | AICall status=ok, no template stored |
 | Mapping references non-existent column | 400 same | AICall status=ok, no template stored |
 | Mapping validates but row parse fails on numerics | 400 same | template NOT stored |
-| Cache hit but row parse fails (format drifted under us) | 400 + evict the stale template in its own commit before raising | — |
+| Cache hit but row parse fails (format drifted under us) | 400 with parse error | — |

-The "delete stale template on parse failure" rule is the only self-healing
-behaviour: if a broker quietly changes their export shape, the next failed
-re-import evicts the old mapping in a dedicated commit (so the eviction
-survives the request-failure rollback) and the LLM gets another shot on the
-subsequent upload. Without this, a once-good template would haunt the cache
-forever. We do **not** auto-retry the LLM in the same request — too much
-hidden cost on a single user action.
+If a broker quietly changes their CSV shape such that a previously-good
+cached mapping starts producing parse failures, the user sees an error and
+the operator deletes the offending `csv_format_templates` row by hand. No
+automatic eviction, no automatic retry. The cache is a learning store, not
+a self-managing system.

 ## Testing

 `tests/test_llm_csv_parser.py`:

 - **Fingerprint stability** — case/whitespace/BOM variants of the same headers hash to the same fingerprint.
- **Cache hit path** — pre-populate a `CsvFormatTemplate` row, mock `call_llm` to fail loudly, assert it is NOT called, assert positions come out correct.
- **Cache miss path** — mock `call_llm` to return a valid mapping JSON, assert a row is inserted, assert positions come out correct, assert the synthesised `sample_dummy` contains placeholder strings only.
+- **Cache hit path** — pre-populate a `CsvFormatTemplate` row, mock `call_llm` to fail loudly, assert it is NOT called, assert positions come out correct, assert `use_count` is incremented.
+- **Cache miss path** — mock `call_llm` to return a valid mapping JSON, assert a row is inserted with the upload's actual first data row as `sample_row` and no user_id anywhere, assert positions come out correct.
 - **LLM returns malformed JSON** — raises `LLMParseError`, no template stored.
 - **LLM maps to non-existent column** — raises `LLMParseError`, no template stored.
 - **LLM maps qty to a non-numeric column** — raises `LLMParseError` on validation.
- **Stale template self-heal** — pre-populate a template that no longer matches the file, simulate row-parse failure, assert the row is deleted and a 400 returned.
+- **Stale cached mapping on parse failure** — pre-populate a template whose mapping no longer matches the file content, assert a 400 is returned and the template is NOT deleted automatically (operator owns eviction).
 - **Integration** — POST a fabricated IBKR-shaped fixture to `/api/portfolio/parse`, assert ParsedPie round-trips, assert no second LLM call on a repeat upload.

 Existing `tests/test_csv_import.py` must still pass — the T212 happy path is unchanged.
@ -261,7 +268,7 @@ End-to-end manual check after deploy:
 1. Upload a T212 fixture → exists path stays unchanged (same dashboard load behaviour).
 2. Upload a fabricated IBKR CSV → first upload calls LLM, returns positions, template row created in DB.
 3. Re-upload the same IBKR CSV → second call has zero LLM cost (verify by counting `ai_calls` rows before/after), `use_count` increments to 2.
-4. Inspect `csv_format_templates` row: confirm `headers` matches the upload's headers, `sample_dummy` contains placeholder strings, no real holdings anywhere.
+4. Inspect `csv_format_templates` row: confirm `headers` matches the upload's headers, `sample_row` is the first real data row, no `user_id` column exists on the table.
 5. Upload random garbage (e.g. a screenshot renamed `.csv`) → 400 with clean error, no template stored, AICall row logged.
 6. Free-tier account attempts import → 402 (paid gating).