read.markets

Author	SHA1	Message	Date
Giorgio Gilestro	de3a9bfa66	review: teach reviewer about the rational-vs-irrational structural device Live logs (analyze.lang_resolved final=it → reviewer_rejected: "Self-questioning and working-through language ('Razionalmente... Irrazionalmente')") showed Haiku confusing the dashboard's mandated "Rational vs Irrational" contrast framework for scratchpad self-questioning. The portfolio-analysis system prompt explicitly requires every paragraph to contrast a rational reading (fundamentals, policy, valuation) with an irrational one (positioning, narrative momentum) — that's structural prose, not the author thinking on the page. Add an explicit "this is a structural device" carve-out to the reviewer prompt, naming the framework and its IT / ES / FR / DE translations so the rejection doesn't reappear when the prompt output lands in non-English. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-29 16:20:00 +02:00
Giorgio Gilestro	f9534f7ad6	review: gate strategic-log, portfolio, chat, and digest on reviewer Extends the reviewer agent — previously only protecting indicator summaries — to every AI-generated surface that reaches a user. The reviewer's prompt already rejects scratchpad, truncation, meta-commentary, and (since `a6e476b`) financial advice; wiring it in turns those rules from prompt-level "asks" into structural gates. Four call sites updated: - ai_log_job.run() : after each tone/analysis variant is generated, pass through review_read. On reject, log the reason and skip the StrategicLog insert; the API's existing "latest StrategicLog" lookup falls back to the previous clean log. - services/portfolio_analysis.analyse() : on reject, raise a clean RuntimeError that the /api/analyze router already maps to HTTP 502 with a retry-able message. Portfolio analysis isn't cached server- side, so the user retries; the reviewer's verdict reason goes into the AICall ledger as the leaked-status row's error column. - routers/chat.chat() : on reject, instead of returning the raw assistant content we return a short refusal explaining the limit and inviting a rephrase. Adds ~1-2 s of latency per turn (one extra LLM call to Haiku) — the only user-facing latency tax. - jobs/email_digest_job._generate_variants() : on reject, the variant is dropped for the cycle. Recipients on the rejected tone get no digest email this run, which is better than delivering inbox copy that drifts into advice (emails are unrecallable once sent). In every case the AICall ledger row records the reviewer cost so month_spend stays accurate across all paths. The reviewer system prompt is slightly generalised to cover both the indicator-summary case and the longer-form log/digest/chat case: - removes "short interpretive read" framing - softens the "any question" rule so genuine rhetorical structure in a long-form log doesn't trigger a reject tests/conftest.py grows an autouse fixture that stubs review_read to clean=True in every consumer module. Tests that mock the generator shouldn't have to also mock the safety gate behind it; tests that specifically want the reject branch can override with their own monkeypatch. test_output_review.py is unaffected — it imports review_read directly. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-29 14:40:04 +02:00
Giorgio Gilestro	a6e476b851	review: reject financial advice in indicator-summary reads Adds a new UNCLEAN criterion to the reviewer agent's system prompt: direct recommendation language (buy/sell/hold/accumulate/trim/rotate), allocation guidance (overweight/underweight, "X% in bonds"), price targets, and personalised framing ("you should", "investors should") all trigger a reject. The operator is not licensed to give investment advice; this is editorial commentary on public data. The generator's system prompt already forbids buy/sell language, but a prompt-only constraint is not an enforcement layer. The reviewer agent — already in the pipeline for chain-of-thought / truncation / meta-commentary — is the right place to enforce the regulatory boundary structurally: rows that drift into advice get dropped, and the API falls back to the previous compliant row. Descriptive / interpretive language about market state remains explicitly allowed ("valuations are stretched", "real yields are restrictive"). The criterion is state vs action: states publish, actions don't. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-29 14:26:37 +02:00
Giorgio Gilestro	385c5fdc60	review: strip markdown code-fences from JSON verdicts Haiku 4.5 occasionally wraps its JSON response in a markdown code fence even with response_format={"type":"json_object"} enforced: ```json {"clean": true, "reason": "polished read"} ``` Live testing the new reviewer caught this — every verdict was being dropped as "reviewer returned non-JSON". Strip a single leading trailing fence before json.loads. Defensive for any model that does the same (Claude variants commonly fence JSON even when told not to). Adds a unit test covering fenced output.	2026-05-29 13:27:37 +02:00
Giorgio Gilestro	788563a81f	ai: route reviewer through OpenRouter + Claude Haiku 4.5 The DeepSeek-V4-flash reviewer was unreliable in production: it pads its JSON verdicts with internal chain-of-thought even when the prompt forbids it, so the verdict gets truncated at any reasonable max_tokens cap and the parser drops it as malformed (a false-negative verdict that would purge clean rows). A live run on 50 rows reproduced the failure on 8 of 12 rejections, even at 800 tokens. Fix: pin the reviewer call to OpenRouter with anthropic/claude-haiku-4.5. Haiku answers structured-output classification tersely (no scratchpad preamble), which means a 300-token cap is comfortably above the ~30-token JSON verdict. Cost is roughly the same (~$0.0001-$0.0003 per review) and the latency tax is smaller. To enable the pinned-provider call without disrupting other callers, call_llm grows an optional `provider` parameter: when set, only that provider is used (no fallback chain). All existing call sites default to provider=None and keep the chain behaviour. REVIEWER_MODEL is read from settings via getattr-with-fallback so an env override can swap models without code changes — useful if we want to A/B test against e.g. gemini-2.5-flash later. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-29 13:21:26 +02:00
Giorgio Gilestro	8b9d3c9c3e	ai: bump reviewer max_tokens 300 → 800 Live re-check on 50 recent IndicatorSummary rows after the previous 120 → 300 bump still produced 4 'reviewer returned non-JSON' verdicts out of 12 rejections. DeepSeek-V4-flash sometimes prefixes its JSON output with a short stretch of thinking even though response_format is enforced, which truncates the JSON at the back end of the 300-token cap. 800 tokens is comfortably above any realistic verdict + preamble at ~$0.00022/call (DeepSeek output rates). Negligible cost given the hourly call volume. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-29 13:16:57 +02:00
Giorgio Gilestro	0550063316	ai: bump reviewer max_tokens 120 → 300 A live sanity-check on 50 recent IndicatorSummary rows found 6 of 10 reviewer rejections were the reviewer hitting its own max_tokens cap mid-verdict ('{"clean": false, "reason": "Truncated sent…'). The parser then dropped the candidate as malformed JSON, producing a false-negative verdict that would have purged legitimately clean rows. 300 tokens is well above the ~30-token verdict the prompt asks for; the extra headroom removes the artefact at ~$0.00015 per call. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-29 13:15:42 +02:00
Giorgio Gilestro	45fa31bb2b	ai: structured-output + reviewer agent for indicator summaries Replaces the regex-based clean_summary / looks_like_leakage pipeline that produced the 2026-05-29 valuation-read leak. Two layers of defence in depth: 1. JSON-mode generation. The per-group and aggregate summary system prompts now require the model to emit a single object {"read": "..."}; response_format={"type":"json_object"} is passed through to the provider so the API enforces well-formed JSON. Prose outside the field is physically impossible. The "read" field is the only schema slot, so the model has nowhere to spill scratchpad into the envelope. 2. Reviewer agent. services/output_review.review_read() makes a second small LLM call that judges whether the candidate "read" string is publishable. It catches the residual failure mode — scratchpad INSIDE the field ("Let's see…", multi-question parentheticals, meta-commentary) — and returns a JSON verdict {"clean": bool, "reason": str}. Any failure (provider error, parse error, missing field) returns clean=false (fail-safe). Cost ~$0.0001/check; latency ~1-2 s in the hourly job, no user-facing latency. The old regex scaffolding (_LEAK_PATTERNS, clean_summary, looks_like_leakage, _TRAILING_QUOTE) is deleted entirely. It produced false positives (chopped legitimate "The indicators are…" leaders) and false negatives (never matched the chain-of-thought patterns the model actually emits). The reviewer agent is strictly better on both. On reviewer/parse rejection: don't persist a new IndicatorSummary; the API's existing fallback to the previous good row continues to serve the panel. Failures are logged as ind_summary.json_invalid / ind_summary.reviewer_rejected so we can measure the rejection rate. Reviewer cost is added to the row's recorded cost_usd so the monthly budget cap covers the full pipeline. Adds tests/test_output_review.py: 11 cases covering _extract_read (JSON envelope handling — invalid JSON, missing field, wrong types, empty values) and review_read (clean / unclean verdicts plus three fail-safe paths for malformed reviewer responses). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-29 13:10:52 +02:00

8 commits