read.markets/app/services/output_review.py
Giorgio Gilestro 0060166d32 review: per-surface rider, loosen for portfolio commentary
Reviewer was rejecting legitimate IT portfolio analyses, citing
descriptive risk language as actionable advice:

  reason: "Allocation guidance throughout: 'concentrazione gestibile',
  'non eliminabile', 'bassa esposizione', 'va monitorato'. Treats
  portfolio construction as actionable."

These phrases describe portfolio state (manageable concentration,
non-eliminable risk, low exposure, warrants monitoring) without
directing the user to take action. They are exactly the kind of
prose a portfolio commentary surface is supposed to produce. The
reviewer's generic "no financial advice" rule is too broad here.

Add a `surface` parameter to review_read() with a per-surface rider
mechanism (_SURFACE_RIDERS). The "portfolio" rider:

- Lists DESCRIPTIVE phrasings that are EXPLICITLY permitted:
  attribute naming ("high concentration", "currency exposure"),
  thesis invalidation conditions, impersonal observations about a
  position's sensitivity.
- Tightens the reject list to EXPLICIT calls to action: imperative
  verbs aimed at the reader, "you should", "consider X-ing",
  specific allocation prescriptions, price-target predictions.

portfolio_analysis.analyse() now passes surface="portfolio". All
other reviewer call sites (indicator summary, log, chat, digest)
default to surface=None and keep the generic rules.

tests/conftest.py's autouse review_read stub picks up **_kw so
adding new keyword arguments to review_read doesn't keep breaking
the locale-integration tests.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-29 16:44:27 +02:00

222 lines
10 KiB
Python

"""Second-pass reviewer agent for AI-generated reads.
The per-group and aggregate indicator summaries are generated in JSON
mode and the publishable text comes out of a single "read" field, but a
misbehaving model can still slip chain-of-thought INSIDE the field
("Let's see…", "X? Actually Y?", multi-question parentheticals). This
module makes a small second LLM call that judges the candidate read as
clean / unclean. Cost is ~$0.0001 per check; latency ~1-2 s in the
hourly job. No user-facing latency.
The reviewer is deliberately a tiny, JSON-shaped classifier — same
JSON-mode mechanism as the generator, so the verdict can't be lost in
prose. If parsing fails or the call errors, the row is rejected
(fail-safe: the previously cached good summary stays visible).
"""
from __future__ import annotations
import json
from dataclasses import dataclass
import httpx
from app.config import get_settings
from app.logging import get_logger
from app.services.openrouter import call_llm
log = get_logger("output_review")
# The reviewer runs through OpenRouter against a small, non-thinking
# model. DeepSeek-V4-flash (our generator default) emits internal
# chain-of-thought before its JSON output even when the prompt forbids
# it, which truncates the JSON at any reasonable max_tokens cap and
# breaks the parser. Anthropic's Haiku family answers structured-output
# tasks tersely and deterministically — no chain-of-thought tax. Cost
# is ~$0.0001-$0.0003 per review depending on candidate length.
DEFAULT_REVIEWER_MODEL = "anthropic/claude-haiku-4.5"
_SYSTEM_PROMPT = """\
You are a strict editor for a financial-markets dashboard. The author
was asked to produce editorial commentary on public market data for
human readers. You receive the proposed text — it may be a one-line
read, a multi-paragraph daily log, a portfolio analysis, a chat
reply, or an email digest — and decide if it is publishable as-is.
Mark CLEAN only if the text reads like finished editorial commentary
a reader could see on a public dashboard without confusion.
Editorial framework you should KNOW about (don't flag these):
This dashboard's voice deliberately contrasts a "rational" read
(fundamentals, policy regime, valuation) with an "irrational" read
(positioning, narrative momentum, flows) and names the gap between
them. Section labels like "Rational:" / "Irrational:" (or "Bull /
Bear", or any explicit "X vs Y" contrast) are STRUCTURAL DEVICES,
not the author thinking on the page. Treat them as finished prose.
The Italian / Spanish / French / German equivalents
("Razionalmente / Irrazionalmente", "Racionalmente / Irracionalmente",
"Rationnellement / Irrationnellement", "Rational / Irrational") are
the same device translated and equally fine.
Mark UNCLEAN if the text contains ANY of:
- Chain-of-thought / scratchpad markers — the author thinking on the
page rather than presenting finished commentary. Phrases like
"Let me", "Let's see", "we need to", "actually" (correcting itself),
"wait", "hmm", "or rather", "I should". Rhetorical questions used
as structure are fine; questions that the author then answers in
front of the reader (self-questioning) are not.
- Self-questioning parentheticals: "Q1 2026? Actually Q4 2025?",
"is it X or Y?", any place where the author appears to be working
out the answer in front of the reader. The "rational vs irrational"
contrast above is NOT self-questioning — the author is presenting
both reads as parallel takes, not asking which one is correct.
- Meta-commentary about the task, output format, word limits, or
instructions — e.g. "as required by the constraints", "the prompt
asks", "let me address each".
- Partial / truncated content. Starts mid-word, mid-number, mid-clause,
ends mid-thought.
- Visible internal numbers without clear meaning ("change 1y +5.9%?"),
raw column names ("as_of 2026-01-01"), or any debug-like fragments.
- FINANCIAL ADVICE or any phrasing that recommends an action the
reader should take. This service is editorial commentary on public
data, not investment advice; the operator is not licensed to give
it. Reject any of:
* Buy/sell/hold/accumulate/trim/exit/enter/rotate language.
* Allocation guidance ("overweight", "underweight",
"X% in bonds", "increase exposure to").
* Price targets or specific level predictions ("will reach $X",
"target Y", "expect Z by year-end").
* Personalised framing ("you should", "investors should",
"consider buying", "we recommend").
DESCRIPTIVE / INTERPRETIVE language about market state is fine —
"valuations are stretched", "real yields are restrictive", "rates
and credit disagree". The test: does the text describe a STATE, or
does it suggest an ACTION? States are fine; actions are not.
- Anything else other than the finished, publishable commentary.
Return ONLY a JSON object with this exact shape:
{"clean": true | false, "reason": "<≤20 words, plain text>"}
No preamble, no markdown fences, no other fields.
"""
# Surface-specific rider appended to the system prompt when the caller
# passes a known `surface` to review_read(). Lets us relax or tighten
# rules per editorial context without rewriting the whole prompt.
_SURFACE_RIDERS = {
"portfolio": """\
# Surface: portfolio commentary
This text describes a real investor's holdings. DESCRIPTIVE risk
language is the whole point of this surface and must NOT be flagged
as financial advice. The following ARE fine:
- Naming portfolio attributes: "high concentration", "single-name
exposure", "currency risk is unhedged", "FX exposure", "elevated
risk", "stretched valuations", "concentration is manageable", "low
diversification".
- Stating what would invalidate the posture: "this view fails if
rates retrace", "the thesis depends on X holding".
- Impersonal observation about a position's behaviour or sensitivity:
"the position warrants monitoring", "carries vulnerability to a
policy shock", "is sensitive to rate moves".
ONLY flag EXPLICIT calls to action where a verb or directive is
aimed at the reader:
- Imperative verbs in the second person: "buy X", "sell Y",
"trim Z", "hedge", "rotate into".
- "You should", "investors should", "consider X-ing", "we recommend".
- Specific allocation prescriptions: "go 20% bonds", "overweight
tech", "underweight defensives".
- Price-target predictions: "will reach $X by year-end".
""",
}
@dataclass(frozen=True)
class Verdict:
clean: bool
reason: str
cost_usd: float | None # cost of the review call itself, for the ledger
async def review_read(
client: httpx.AsyncClient,
candidate: str,
surface: str | None = None,
) -> Verdict:
"""Ask the LLM whether `candidate` is a publishable read.
Returns Verdict(clean, reason, cost). Any error — provider failure,
JSON parse failure, missing field, wrong type — yields a CONSERVATIVE
verdict (clean=False) so the caller drops the candidate. The
previously cached good summary stays visible on the dashboard.
`surface` selects a surface-specific rider that's appended to the
base system prompt — see _SURFACE_RIDERS. Currently only the
"portfolio" surface uses one (descriptive risk language is the
whole point there and shouldn't be flagged as advice). Unknown
or None surfaces fall back to the generic rules."""
if not candidate or not candidate.strip():
return Verdict(clean=False, reason="empty candidate", cost_usd=0.0)
system_prompt = _SYSTEM_PROMPT
if surface and surface in _SURFACE_RIDERS:
system_prompt = system_prompt + _SURFACE_RIDERS[surface]
messages = [
{"role": "system", "content": system_prompt},
# Sent as a fenced user turn so the model can't confuse the
# candidate with instructions, even if the candidate happens to
# contain prompt-like prose.
{"role": "user", "content": f"Candidate read:\n```\n{candidate}\n```"},
]
settings = get_settings()
reviewer_model = getattr(settings, "REVIEWER_MODEL", None) or DEFAULT_REVIEWER_MODEL
try:
result = await call_llm(
client, messages,
# Pin to OpenRouter so a non-DeepSeek model like Haiku is
# actually reachable; the default provider chain would try
# DeepSeek native first and 404 on the Anthropic model name.
provider="openrouter",
model=reviewer_model,
# 300 tokens is well above the ~30-token JSON verdict.
# Haiku doesn't pad with hidden reasoning the way DeepSeek
# does, so we don't need the 800-token headroom required to
# absorb the generator's chain-of-thought.
max_tokens=300,
response_format={"type": "json_object"},
)
except Exception as e:
log.warning("review.call_failed", error=str(e)[:200])
return Verdict(clean=False, reason=f"reviewer error: {str(e)[:80]}",
cost_usd=None)
# Haiku (and several other models) occasionally wrap their JSON
# output in a markdown code fence even with response_format set —
# ```json\n{...}\n``` — so strip a single leading/trailing fence
# before parsing. We do this defensively for any model; it's a
# no-op for callers that already emit bare JSON.
raw = result.content.strip()
if raw.startswith("```"):
first_nl = raw.find("\n")
if first_nl != -1:
raw = raw[first_nl + 1:]
if raw.rstrip().endswith("```"):
raw = raw.rstrip()[:-3].rstrip()
raw = raw.strip()
try:
parsed = json.loads(raw)
except json.JSONDecodeError:
log.warning("review.parse_failed", preview=result.content[:200])
return Verdict(clean=False, reason="reviewer returned non-JSON",
cost_usd=result.cost_usd)
clean = parsed.get("clean")
reason = parsed.get("reason") or ""
if not isinstance(clean, bool):
return Verdict(clean=False, reason="reviewer omitted bool 'clean'",
cost_usd=result.cost_usd)
return Verdict(clean=clean, reason=str(reason)[:200], cost_usd=result.cost_usd)