From 02545159896381a26423b7374a65e8705cffbdf1 Mon Sep 17 00:00:00 2001 From: Giorgio Gilestro Date: Wed, 27 May 2026 11:21:43 +0200 Subject: [PATCH] =?UTF-8?q?docs:=20refine=20LLM-CSV=20spec=20=E2=80=94=20k?= =?UTF-8?q?eep=20real=20sample=20row,=20drop=20user=20attribution?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Drop first_seen_user_id; sample is anonymous by construction - Rename sample_dummy → sample_row, store the upload's first real data row verbatim (one row, no totals, no other positions, no link to a user). Narrow, deliberate exception to the "no holdings persisted" invariant — gives the operator material for hand-writing future native parsers. - Drop the cache self-heal behaviour; operator owns eviction. Reinforce the non-goal of auto-promoting learned formats to code. --- ...26-05-27-llm-csv-fallback-parser-design.md | 71 ++++++++++--------- 1 file changed, 39 insertions(+), 32 deletions(-) diff --git a/docs/superpowers/specs/2026-05-27-llm-csv-fallback-parser-design.md b/docs/superpowers/specs/2026-05-27-llm-csv-fallback-parser-design.md index 97532f5..c946c8f 100644 --- a/docs/superpowers/specs/2026-05-27-llm-csv-fallback-parser-design.md +++ b/docs/superpowers/specs/2026-05-27-llm-csv-fallback-parser-design.md @@ -22,9 +22,12 @@ in a new `csv_format_templates` table. Every subsequent upload of the same format — by any user — replays the cached mapping deterministically, with no LLM call. -The cache row stores headers and a synthetic placeholder ("dummy") row, never -real user data. The mapping itself is a column-name dictionary, also free of -holdings. +The cache row stores the header row and a single anonymous sample data row +(the first row from the originating upload, verbatim). No user identifier is +recorded — the row is not linked back to whoever uploaded it. The purpose of +the sample is to give the operator material to look at when designing future +native parsers; this collection is **passive learning only**, the system +never attempts to author or modify parser code automatically. Portfolio import is already advertised as a paid-only feature; we make that explicit at the route level as part of this work. @@ -42,8 +45,13 @@ explicit at the route level as part of this work. - Per-broker UI customisation. The drop-zone stays generic. - A human admin queue for reviewing LLM-discovered formats. Operator can inspect rows directly in the DB if curious. -- Promoting "trusted" formats to native parsers. That's a future evolution - if a single broker dominates LLM-parsed traffic. +- **Auto-promoting learned formats to native parsers.** The operator will + hand-write any native parser by looking at the collected sample rows. The + system never writes or modifies code. +- Self-healing or auto-evicting stale cache entries. If a broker silently + changes their export shape under us, the cached mapping will start + producing parse errors; the operator deletes the row manually. We do not + invalidate cache entries automatically. - Multi-stage / verification LLM passes. One call per first-time format. ## Architecture @@ -83,10 +91,11 @@ LLM returns column names, not transcribed numbers. Three benefits: ### Why global cache, not per-user The column structure of an IBKR Activity Statement is a property of IBKR, not -of any individual user. The stored "dummy" contains no PII (column headers -are public; the sample row is synthetic). So global cache is strictly better: -faster onboarding for the second IBKR user, and the operator still gets a -`first_seen_user_id` audit column for forensic traceability. +of any individual user. The cache row contains no user identifier — the +sample data row is stored verbatim but anonymously, with nothing linking it +to the uploader. Global cache is strictly better: faster onboarding for the +second IBKR user, and the collected samples form a small, useful corpus for +hand-writing native parsers later. ## Data model @@ -96,26 +105,27 @@ New table `csv_format_templates`: |---|---|---| | `id` | int PK | | | `fingerprint` | `VARCHAR(64) UNIQUE NOT NULL` | sha256 hex of normalised header tuple | -| `headers` | JSON | List of strings — actual header row, no PII | -| `sample_dummy` | JSON | One synthetic placeholder row for human eyeball | +| `headers` | JSON | List of strings — actual header row from the upload | +| `sample_row` | JSON | First data row from the originating upload, verbatim. Not linked to any user. | | `mapping` | JSON | `{ticker_col, qty_col, name_col, cost_col, currency_col}` | | `preamble_rows` | INT NOT NULL DEFAULT 0 | Non-data lines before the header row | | `delimiter` | CHAR(1) NOT NULL DEFAULT ',' | | | `broker_label` | VARCHAR(128) | LLM-identified label, e.g. "Interactive Brokers Activity Statement" | -| `first_seen_user_id` | INT NULL, FK users(id) ON DELETE SET NULL | Audit only | -| `first_seen_at` | DATETIME(tz) NOT NULL | | -| `use_count` | INT NOT NULL DEFAULT 1 | Bumped on cache hit | +| `first_seen_at` | DATETIME(tz) NOT NULL | When the format was first cached | +| `use_count` | INT NOT NULL DEFAULT 1 | Bumped on each successful cache hit | | `last_used_at` | DATETIME(tz) NOT NULL | | | `llm_model` | VARCHAR(64) | Provenance of the initial extraction | | `llm_cost_usd` | FLOAT | Same | Migration: `alembic/versions/0021_csv_format_template.py` (based on `0020`). -No raw CSV bytes are ever stored. `headers` and `sample_dummy` are the only -payloads. `sample_dummy` is synthesised post-extraction by replacing column -values with placeholder strings (`"TICKER"`, `"100"`, `"1.50"`) keyed to the -mapping — the operator can eyeball the format shape without seeing any real -holdings. +The full uploaded CSV is **not** stored — only the header row plus a single +data row (`sample_row`). No `user_id` column exists on this table; the sample +is anonymous by construction. This is a deliberate, narrow exception to the +otherwise-strict "no holdings persisted" invariant: we keep one row per +format so the operator has concrete material to look at when hand-writing a +future native parser. One anonymous row carries no portfolio context (no +totals, no other positions) and cannot be linked back to an account. ## Components @@ -148,7 +158,6 @@ Internal helpers (not exported): - `_extract_mapping_via_llm(client, headers, samples) -> dict` — builds the system prompt, calls `openrouter.call_llm`, parses the JSON envelope, raises `LLMParseError` on malformed output. - `_validate_mapping(mapping, headers, first_row) -> None` — every named column must exist in `headers`; `qty_col`'s value on `first_row` must parse as a positive number; `cost_col` (if present) must parse as a number. Raises `LLMParseError` on failure. - `_apply_mapping(rows, mapping) -> ParsedPie` — iterates remaining rows, builds `ParsedPosition` instances, computes totals from `qty * avg_cost` when explicit totals aren't present. -- `_synthesise_dummy(headers, mapping) -> dict` — produces the placeholder row for `sample_dummy`. Reuses without modification: @@ -229,27 +238,25 @@ Token cost is bounded and uniform regardless of portfolio size. | Mapping missing required columns (ticker/qty) | 400 same | AICall status=ok, no template stored | | Mapping references non-existent column | 400 same | AICall status=ok, no template stored | | Mapping validates but row parse fails on numerics | 400 same | template NOT stored | -| Cache hit but row parse fails (format drifted under us) | 400 + evict the stale template in its own commit before raising | — | +| Cache hit but row parse fails (format drifted under us) | 400 with parse error | — | -The "delete stale template on parse failure" rule is the only self-healing -behaviour: if a broker quietly changes their export shape, the next failed -re-import evicts the old mapping in a dedicated commit (so the eviction -survives the request-failure rollback) and the LLM gets another shot on the -subsequent upload. Without this, a once-good template would haunt the cache -forever. We do **not** auto-retry the LLM in the same request — too much -hidden cost on a single user action. +If a broker quietly changes their CSV shape such that a previously-good +cached mapping starts producing parse failures, the user sees an error and +the operator deletes the offending `csv_format_templates` row by hand. No +automatic eviction, no automatic retry. The cache is a learning store, not +a self-managing system. ## Testing `tests/test_llm_csv_parser.py`: - **Fingerprint stability** — case/whitespace/BOM variants of the same headers hash to the same fingerprint. -- **Cache hit path** — pre-populate a `CsvFormatTemplate` row, mock `call_llm` to fail loudly, assert it is NOT called, assert positions come out correct. -- **Cache miss path** — mock `call_llm` to return a valid mapping JSON, assert a row is inserted, assert positions come out correct, assert the synthesised `sample_dummy` contains placeholder strings only. +- **Cache hit path** — pre-populate a `CsvFormatTemplate` row, mock `call_llm` to fail loudly, assert it is NOT called, assert positions come out correct, assert `use_count` is incremented. +- **Cache miss path** — mock `call_llm` to return a valid mapping JSON, assert a row is inserted with the upload's actual first data row as `sample_row` and no user_id anywhere, assert positions come out correct. - **LLM returns malformed JSON** — raises `LLMParseError`, no template stored. - **LLM maps to non-existent column** — raises `LLMParseError`, no template stored. - **LLM maps qty to a non-numeric column** — raises `LLMParseError` on validation. -- **Stale template self-heal** — pre-populate a template that no longer matches the file, simulate row-parse failure, assert the row is deleted and a 400 returned. +- **Stale cached mapping on parse failure** — pre-populate a template whose mapping no longer matches the file content, assert a 400 is returned and the template is NOT deleted automatically (operator owns eviction). - **Integration** — POST a fabricated IBKR-shaped fixture to `/api/portfolio/parse`, assert ParsedPie round-trips, assert no second LLM call on a repeat upload. Existing `tests/test_csv_import.py` must still pass — the T212 happy path is unchanged. @@ -261,7 +268,7 @@ End-to-end manual check after deploy: 1. Upload a T212 fixture → exists path stays unchanged (same dashboard load behaviour). 2. Upload a fabricated IBKR CSV → first upload calls LLM, returns positions, template row created in DB. 3. Re-upload the same IBKR CSV → second call has zero LLM cost (verify by counting `ai_calls` rows before/after), `use_count` increments to 2. -4. Inspect `csv_format_templates` row: confirm `headers` matches the upload's headers, `sample_dummy` contains placeholder strings, no real holdings anywhere. +4. Inspect `csv_format_templates` row: confirm `headers` matches the upload's headers, `sample_row` is the first real data row, no `user_id` column exists on the table. 5. Upload random garbage (e.g. a screenshot renamed `.csv`) → 400 with clean error, no template stored, AICall row logged. 6. Free-tier account attempts import → 402 (paid gating).