docs: refine LLM-CSV spec — keep real sample row, drop user attribution
- Drop first_seen_user_id; sample is anonymous by construction - Rename sample_dummy → sample_row, store the upload's first real data row verbatim (one row, no totals, no other positions, no link to a user). Narrow, deliberate exception to the "no holdings persisted" invariant — gives the operator material for hand-writing future native parsers. - Drop the cache self-heal behaviour; operator owns eviction. Reinforce the non-goal of auto-promoting learned formats to code.
This commit is contained in:
parent
263ecc0d3b
commit
0254515989
1 changed files with 39 additions and 32 deletions
|
|
@ -22,9 +22,12 @@ in a new `csv_format_templates` table. Every subsequent upload of the same
|
|||
format — by any user — replays the cached mapping deterministically, with no
|
||||
LLM call.
|
||||
|
||||
The cache row stores headers and a synthetic placeholder ("dummy") row, never
|
||||
real user data. The mapping itself is a column-name dictionary, also free of
|
||||
holdings.
|
||||
The cache row stores the header row and a single anonymous sample data row
|
||||
(the first row from the originating upload, verbatim). No user identifier is
|
||||
recorded — the row is not linked back to whoever uploaded it. The purpose of
|
||||
the sample is to give the operator material to look at when designing future
|
||||
native parsers; this collection is **passive learning only**, the system
|
||||
never attempts to author or modify parser code automatically.
|
||||
|
||||
Portfolio import is already advertised as a paid-only feature; we make that
|
||||
explicit at the route level as part of this work.
|
||||
|
|
@ -42,8 +45,13 @@ explicit at the route level as part of this work.
|
|||
- Per-broker UI customisation. The drop-zone stays generic.
|
||||
- A human admin queue for reviewing LLM-discovered formats. Operator can
|
||||
inspect rows directly in the DB if curious.
|
||||
- Promoting "trusted" formats to native parsers. That's a future evolution
|
||||
if a single broker dominates LLM-parsed traffic.
|
||||
- **Auto-promoting learned formats to native parsers.** The operator will
|
||||
hand-write any native parser by looking at the collected sample rows. The
|
||||
system never writes or modifies code.
|
||||
- Self-healing or auto-evicting stale cache entries. If a broker silently
|
||||
changes their export shape under us, the cached mapping will start
|
||||
producing parse errors; the operator deletes the row manually. We do not
|
||||
invalidate cache entries automatically.
|
||||
- Multi-stage / verification LLM passes. One call per first-time format.
|
||||
|
||||
## Architecture
|
||||
|
|
@ -83,10 +91,11 @@ LLM returns column names, not transcribed numbers. Three benefits:
|
|||
### Why global cache, not per-user
|
||||
|
||||
The column structure of an IBKR Activity Statement is a property of IBKR, not
|
||||
of any individual user. The stored "dummy" contains no PII (column headers
|
||||
are public; the sample row is synthetic). So global cache is strictly better:
|
||||
faster onboarding for the second IBKR user, and the operator still gets a
|
||||
`first_seen_user_id` audit column for forensic traceability.
|
||||
of any individual user. The cache row contains no user identifier — the
|
||||
sample data row is stored verbatim but anonymously, with nothing linking it
|
||||
to the uploader. Global cache is strictly better: faster onboarding for the
|
||||
second IBKR user, and the collected samples form a small, useful corpus for
|
||||
hand-writing native parsers later.
|
||||
|
||||
## Data model
|
||||
|
||||
|
|
@ -96,26 +105,27 @@ New table `csv_format_templates`:
|
|||
|---|---|---|
|
||||
| `id` | int PK | |
|
||||
| `fingerprint` | `VARCHAR(64) UNIQUE NOT NULL` | sha256 hex of normalised header tuple |
|
||||
| `headers` | JSON | List of strings — actual header row, no PII |
|
||||
| `sample_dummy` | JSON | One synthetic placeholder row for human eyeball |
|
||||
| `headers` | JSON | List of strings — actual header row from the upload |
|
||||
| `sample_row` | JSON | First data row from the originating upload, verbatim. Not linked to any user. |
|
||||
| `mapping` | JSON | `{ticker_col, qty_col, name_col, cost_col, currency_col}` |
|
||||
| `preamble_rows` | INT NOT NULL DEFAULT 0 | Non-data lines before the header row |
|
||||
| `delimiter` | CHAR(1) NOT NULL DEFAULT ',' | |
|
||||
| `broker_label` | VARCHAR(128) | LLM-identified label, e.g. "Interactive Brokers Activity Statement" |
|
||||
| `first_seen_user_id` | INT NULL, FK users(id) ON DELETE SET NULL | Audit only |
|
||||
| `first_seen_at` | DATETIME(tz) NOT NULL | |
|
||||
| `use_count` | INT NOT NULL DEFAULT 1 | Bumped on cache hit |
|
||||
| `first_seen_at` | DATETIME(tz) NOT NULL | When the format was first cached |
|
||||
| `use_count` | INT NOT NULL DEFAULT 1 | Bumped on each successful cache hit |
|
||||
| `last_used_at` | DATETIME(tz) NOT NULL | |
|
||||
| `llm_model` | VARCHAR(64) | Provenance of the initial extraction |
|
||||
| `llm_cost_usd` | FLOAT | Same |
|
||||
|
||||
Migration: `alembic/versions/0021_csv_format_template.py` (based on `0020`).
|
||||
|
||||
No raw CSV bytes are ever stored. `headers` and `sample_dummy` are the only
|
||||
payloads. `sample_dummy` is synthesised post-extraction by replacing column
|
||||
values with placeholder strings (`"TICKER"`, `"100"`, `"1.50"`) keyed to the
|
||||
mapping — the operator can eyeball the format shape without seeing any real
|
||||
holdings.
|
||||
The full uploaded CSV is **not** stored — only the header row plus a single
|
||||
data row (`sample_row`). No `user_id` column exists on this table; the sample
|
||||
is anonymous by construction. This is a deliberate, narrow exception to the
|
||||
otherwise-strict "no holdings persisted" invariant: we keep one row per
|
||||
format so the operator has concrete material to look at when hand-writing a
|
||||
future native parser. One anonymous row carries no portfolio context (no
|
||||
totals, no other positions) and cannot be linked back to an account.
|
||||
|
||||
## Components
|
||||
|
||||
|
|
@ -148,7 +158,6 @@ Internal helpers (not exported):
|
|||
- `_extract_mapping_via_llm(client, headers, samples) -> dict` — builds the system prompt, calls `openrouter.call_llm`, parses the JSON envelope, raises `LLMParseError` on malformed output.
|
||||
- `_validate_mapping(mapping, headers, first_row) -> None` — every named column must exist in `headers`; `qty_col`'s value on `first_row` must parse as a positive number; `cost_col` (if present) must parse as a number. Raises `LLMParseError` on failure.
|
||||
- `_apply_mapping(rows, mapping) -> ParsedPie` — iterates remaining rows, builds `ParsedPosition` instances, computes totals from `qty * avg_cost` when explicit totals aren't present.
|
||||
- `_synthesise_dummy(headers, mapping) -> dict` — produces the placeholder row for `sample_dummy`.
|
||||
|
||||
Reuses without modification:
|
||||
|
||||
|
|
@ -229,27 +238,25 @@ Token cost is bounded and uniform regardless of portfolio size.
|
|||
| Mapping missing required columns (ticker/qty) | 400 same | AICall status=ok, no template stored |
|
||||
| Mapping references non-existent column | 400 same | AICall status=ok, no template stored |
|
||||
| Mapping validates but row parse fails on numerics | 400 same | template NOT stored |
|
||||
| Cache hit but row parse fails (format drifted under us) | 400 + evict the stale template in its own commit before raising | — |
|
||||
| Cache hit but row parse fails (format drifted under us) | 400 with parse error | — |
|
||||
|
||||
The "delete stale template on parse failure" rule is the only self-healing
|
||||
behaviour: if a broker quietly changes their export shape, the next failed
|
||||
re-import evicts the old mapping in a dedicated commit (so the eviction
|
||||
survives the request-failure rollback) and the LLM gets another shot on the
|
||||
subsequent upload. Without this, a once-good template would haunt the cache
|
||||
forever. We do **not** auto-retry the LLM in the same request — too much
|
||||
hidden cost on a single user action.
|
||||
If a broker quietly changes their CSV shape such that a previously-good
|
||||
cached mapping starts producing parse failures, the user sees an error and
|
||||
the operator deletes the offending `csv_format_templates` row by hand. No
|
||||
automatic eviction, no automatic retry. The cache is a learning store, not
|
||||
a self-managing system.
|
||||
|
||||
## Testing
|
||||
|
||||
`tests/test_llm_csv_parser.py`:
|
||||
|
||||
- **Fingerprint stability** — case/whitespace/BOM variants of the same headers hash to the same fingerprint.
|
||||
- **Cache hit path** — pre-populate a `CsvFormatTemplate` row, mock `call_llm` to fail loudly, assert it is NOT called, assert positions come out correct.
|
||||
- **Cache miss path** — mock `call_llm` to return a valid mapping JSON, assert a row is inserted, assert positions come out correct, assert the synthesised `sample_dummy` contains placeholder strings only.
|
||||
- **Cache hit path** — pre-populate a `CsvFormatTemplate` row, mock `call_llm` to fail loudly, assert it is NOT called, assert positions come out correct, assert `use_count` is incremented.
|
||||
- **Cache miss path** — mock `call_llm` to return a valid mapping JSON, assert a row is inserted with the upload's actual first data row as `sample_row` and no user_id anywhere, assert positions come out correct.
|
||||
- **LLM returns malformed JSON** — raises `LLMParseError`, no template stored.
|
||||
- **LLM maps to non-existent column** — raises `LLMParseError`, no template stored.
|
||||
- **LLM maps qty to a non-numeric column** — raises `LLMParseError` on validation.
|
||||
- **Stale template self-heal** — pre-populate a template that no longer matches the file, simulate row-parse failure, assert the row is deleted and a 400 returned.
|
||||
- **Stale cached mapping on parse failure** — pre-populate a template whose mapping no longer matches the file content, assert a 400 is returned and the template is NOT deleted automatically (operator owns eviction).
|
||||
- **Integration** — POST a fabricated IBKR-shaped fixture to `/api/portfolio/parse`, assert ParsedPie round-trips, assert no second LLM call on a repeat upload.
|
||||
|
||||
Existing `tests/test_csv_import.py` must still pass — the T212 happy path is unchanged.
|
||||
|
|
@ -261,7 +268,7 @@ End-to-end manual check after deploy:
|
|||
1. Upload a T212 fixture → exists path stays unchanged (same dashboard load behaviour).
|
||||
2. Upload a fabricated IBKR CSV → first upload calls LLM, returns positions, template row created in DB.
|
||||
3. Re-upload the same IBKR CSV → second call has zero LLM cost (verify by counting `ai_calls` rows before/after), `use_count` increments to 2.
|
||||
4. Inspect `csv_format_templates` row: confirm `headers` matches the upload's headers, `sample_dummy` contains placeholder strings, no real holdings anywhere.
|
||||
4. Inspect `csv_format_templates` row: confirm `headers` matches the upload's headers, `sample_row` is the first real data row, no `user_id` column exists on the table.
|
||||
5. Upload random garbage (e.g. a screenshot renamed `.csv`) → 400 with clean error, no template stored, AICall row logged.
|
||||
6. Free-tier account attempts import → 402 (paid gating).
|
||||
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue