Reviewer was rejecting legitimate IT portfolio analyses, citing
descriptive risk language as actionable advice:
reason: "Allocation guidance throughout: 'concentrazione gestibile',
'non eliminabile', 'bassa esposizione', 'va monitorato'. Treats
portfolio construction as actionable."
These phrases describe portfolio state (manageable concentration,
non-eliminable risk, low exposure, warrants monitoring) without
directing the user to take action. They are exactly the kind of
prose a portfolio commentary surface is supposed to produce. The
reviewer's generic "no financial advice" rule is too broad here.
Add a `surface` parameter to review_read() with a per-surface rider
mechanism (_SURFACE_RIDERS). The "portfolio" rider:
- Lists DESCRIPTIVE phrasings that are EXPLICITLY permitted:
attribute naming ("high concentration", "currency exposure"),
thesis invalidation conditions, impersonal observations about a
position's sensitivity.
- Tightens the reject list to EXPLICIT calls to action: imperative
verbs aimed at the reader, "you should", "consider X-ing",
specific allocation prescriptions, price-target predictions.
portfolio_analysis.analyse() now passes surface="portfolio". All
other reviewer call sites (indicator summary, log, chat, digest)
default to surface=None and keep the generic rules.
tests/conftest.py's autouse review_read stub picks up **_kw so
adding new keyword arguments to review_read doesn't keep breaking
the locale-integration tests.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
222 lines
10 KiB
Python
222 lines
10 KiB
Python
"""Second-pass reviewer agent for AI-generated reads.
|
|
|
|
The per-group and aggregate indicator summaries are generated in JSON
|
|
mode and the publishable text comes out of a single "read" field, but a
|
|
misbehaving model can still slip chain-of-thought INSIDE the field
|
|
("Let's see…", "X? Actually Y?", multi-question parentheticals). This
|
|
module makes a small second LLM call that judges the candidate read as
|
|
clean / unclean. Cost is ~$0.0001 per check; latency ~1-2 s in the
|
|
hourly job. No user-facing latency.
|
|
|
|
The reviewer is deliberately a tiny, JSON-shaped classifier — same
|
|
JSON-mode mechanism as the generator, so the verdict can't be lost in
|
|
prose. If parsing fails or the call errors, the row is rejected
|
|
(fail-safe: the previously cached good summary stays visible).
|
|
"""
|
|
from __future__ import annotations
|
|
|
|
import json
|
|
from dataclasses import dataclass
|
|
|
|
import httpx
|
|
|
|
from app.config import get_settings
|
|
from app.logging import get_logger
|
|
from app.services.openrouter import call_llm
|
|
|
|
log = get_logger("output_review")
|
|
|
|
|
|
# The reviewer runs through OpenRouter against a small, non-thinking
|
|
# model. DeepSeek-V4-flash (our generator default) emits internal
|
|
# chain-of-thought before its JSON output even when the prompt forbids
|
|
# it, which truncates the JSON at any reasonable max_tokens cap and
|
|
# breaks the parser. Anthropic's Haiku family answers structured-output
|
|
# tasks tersely and deterministically — no chain-of-thought tax. Cost
|
|
# is ~$0.0001-$0.0003 per review depending on candidate length.
|
|
DEFAULT_REVIEWER_MODEL = "anthropic/claude-haiku-4.5"
|
|
|
|
|
|
_SYSTEM_PROMPT = """\
|
|
You are a strict editor for a financial-markets dashboard. The author
|
|
was asked to produce editorial commentary on public market data for
|
|
human readers. You receive the proposed text — it may be a one-line
|
|
read, a multi-paragraph daily log, a portfolio analysis, a chat
|
|
reply, or an email digest — and decide if it is publishable as-is.
|
|
|
|
Mark CLEAN only if the text reads like finished editorial commentary
|
|
a reader could see on a public dashboard without confusion.
|
|
|
|
Editorial framework you should KNOW about (don't flag these):
|
|
This dashboard's voice deliberately contrasts a "rational" read
|
|
(fundamentals, policy regime, valuation) with an "irrational" read
|
|
(positioning, narrative momentum, flows) and names the gap between
|
|
them. Section labels like "Rational:" / "Irrational:" (or "Bull /
|
|
Bear", or any explicit "X vs Y" contrast) are STRUCTURAL DEVICES,
|
|
not the author thinking on the page. Treat them as finished prose.
|
|
The Italian / Spanish / French / German equivalents
|
|
("Razionalmente / Irrazionalmente", "Racionalmente / Irracionalmente",
|
|
"Rationnellement / Irrationnellement", "Rational / Irrational") are
|
|
the same device translated and equally fine.
|
|
|
|
Mark UNCLEAN if the text contains ANY of:
|
|
- Chain-of-thought / scratchpad markers — the author thinking on the
|
|
page rather than presenting finished commentary. Phrases like
|
|
"Let me", "Let's see", "we need to", "actually" (correcting itself),
|
|
"wait", "hmm", "or rather", "I should". Rhetorical questions used
|
|
as structure are fine; questions that the author then answers in
|
|
front of the reader (self-questioning) are not.
|
|
- Self-questioning parentheticals: "Q1 2026? Actually Q4 2025?",
|
|
"is it X or Y?", any place where the author appears to be working
|
|
out the answer in front of the reader. The "rational vs irrational"
|
|
contrast above is NOT self-questioning — the author is presenting
|
|
both reads as parallel takes, not asking which one is correct.
|
|
- Meta-commentary about the task, output format, word limits, or
|
|
instructions — e.g. "as required by the constraints", "the prompt
|
|
asks", "let me address each".
|
|
- Partial / truncated content. Starts mid-word, mid-number, mid-clause,
|
|
ends mid-thought.
|
|
- Visible internal numbers without clear meaning ("change 1y +5.9%?"),
|
|
raw column names ("as_of 2026-01-01"), or any debug-like fragments.
|
|
- FINANCIAL ADVICE or any phrasing that recommends an action the
|
|
reader should take. This service is editorial commentary on public
|
|
data, not investment advice; the operator is not licensed to give
|
|
it. Reject any of:
|
|
* Buy/sell/hold/accumulate/trim/exit/enter/rotate language.
|
|
* Allocation guidance ("overweight", "underweight",
|
|
"X% in bonds", "increase exposure to").
|
|
* Price targets or specific level predictions ("will reach $X",
|
|
"target Y", "expect Z by year-end").
|
|
* Personalised framing ("you should", "investors should",
|
|
"consider buying", "we recommend").
|
|
DESCRIPTIVE / INTERPRETIVE language about market state is fine —
|
|
"valuations are stretched", "real yields are restrictive", "rates
|
|
and credit disagree". The test: does the text describe a STATE, or
|
|
does it suggest an ACTION? States are fine; actions are not.
|
|
- Anything else other than the finished, publishable commentary.
|
|
|
|
Return ONLY a JSON object with this exact shape:
|
|
{"clean": true | false, "reason": "<≤20 words, plain text>"}
|
|
No preamble, no markdown fences, no other fields.
|
|
"""
|
|
|
|
|
|
# Surface-specific rider appended to the system prompt when the caller
|
|
# passes a known `surface` to review_read(). Lets us relax or tighten
|
|
# rules per editorial context without rewriting the whole prompt.
|
|
_SURFACE_RIDERS = {
|
|
"portfolio": """\
|
|
|
|
# Surface: portfolio commentary
|
|
This text describes a real investor's holdings. DESCRIPTIVE risk
|
|
language is the whole point of this surface and must NOT be flagged
|
|
as financial advice. The following ARE fine:
|
|
- Naming portfolio attributes: "high concentration", "single-name
|
|
exposure", "currency risk is unhedged", "FX exposure", "elevated
|
|
risk", "stretched valuations", "concentration is manageable", "low
|
|
diversification".
|
|
- Stating what would invalidate the posture: "this view fails if
|
|
rates retrace", "the thesis depends on X holding".
|
|
- Impersonal observation about a position's behaviour or sensitivity:
|
|
"the position warrants monitoring", "carries vulnerability to a
|
|
policy shock", "is sensitive to rate moves".
|
|
|
|
ONLY flag EXPLICIT calls to action where a verb or directive is
|
|
aimed at the reader:
|
|
- Imperative verbs in the second person: "buy X", "sell Y",
|
|
"trim Z", "hedge", "rotate into".
|
|
- "You should", "investors should", "consider X-ing", "we recommend".
|
|
- Specific allocation prescriptions: "go 20% bonds", "overweight
|
|
tech", "underweight defensives".
|
|
- Price-target predictions: "will reach $X by year-end".
|
|
""",
|
|
}
|
|
|
|
|
|
@dataclass(frozen=True)
|
|
class Verdict:
|
|
clean: bool
|
|
reason: str
|
|
cost_usd: float | None # cost of the review call itself, for the ledger
|
|
|
|
|
|
async def review_read(
|
|
client: httpx.AsyncClient,
|
|
candidate: str,
|
|
surface: str | None = None,
|
|
) -> Verdict:
|
|
"""Ask the LLM whether `candidate` is a publishable read.
|
|
|
|
Returns Verdict(clean, reason, cost). Any error — provider failure,
|
|
JSON parse failure, missing field, wrong type — yields a CONSERVATIVE
|
|
verdict (clean=False) so the caller drops the candidate. The
|
|
previously cached good summary stays visible on the dashboard.
|
|
|
|
`surface` selects a surface-specific rider that's appended to the
|
|
base system prompt — see _SURFACE_RIDERS. Currently only the
|
|
"portfolio" surface uses one (descriptive risk language is the
|
|
whole point there and shouldn't be flagged as advice). Unknown
|
|
or None surfaces fall back to the generic rules."""
|
|
if not candidate or not candidate.strip():
|
|
return Verdict(clean=False, reason="empty candidate", cost_usd=0.0)
|
|
|
|
system_prompt = _SYSTEM_PROMPT
|
|
if surface and surface in _SURFACE_RIDERS:
|
|
system_prompt = system_prompt + _SURFACE_RIDERS[surface]
|
|
|
|
messages = [
|
|
{"role": "system", "content": system_prompt},
|
|
# Sent as a fenced user turn so the model can't confuse the
|
|
# candidate with instructions, even if the candidate happens to
|
|
# contain prompt-like prose.
|
|
{"role": "user", "content": f"Candidate read:\n```\n{candidate}\n```"},
|
|
]
|
|
settings = get_settings()
|
|
reviewer_model = getattr(settings, "REVIEWER_MODEL", None) or DEFAULT_REVIEWER_MODEL
|
|
try:
|
|
result = await call_llm(
|
|
client, messages,
|
|
# Pin to OpenRouter so a non-DeepSeek model like Haiku is
|
|
# actually reachable; the default provider chain would try
|
|
# DeepSeek native first and 404 on the Anthropic model name.
|
|
provider="openrouter",
|
|
model=reviewer_model,
|
|
# 300 tokens is well above the ~30-token JSON verdict.
|
|
# Haiku doesn't pad with hidden reasoning the way DeepSeek
|
|
# does, so we don't need the 800-token headroom required to
|
|
# absorb the generator's chain-of-thought.
|
|
max_tokens=300,
|
|
response_format={"type": "json_object"},
|
|
)
|
|
except Exception as e:
|
|
log.warning("review.call_failed", error=str(e)[:200])
|
|
return Verdict(clean=False, reason=f"reviewer error: {str(e)[:80]}",
|
|
cost_usd=None)
|
|
|
|
# Haiku (and several other models) occasionally wrap their JSON
|
|
# output in a markdown code fence even with response_format set —
|
|
# ```json\n{...}\n``` — so strip a single leading/trailing fence
|
|
# before parsing. We do this defensively for any model; it's a
|
|
# no-op for callers that already emit bare JSON.
|
|
raw = result.content.strip()
|
|
if raw.startswith("```"):
|
|
first_nl = raw.find("\n")
|
|
if first_nl != -1:
|
|
raw = raw[first_nl + 1:]
|
|
if raw.rstrip().endswith("```"):
|
|
raw = raw.rstrip()[:-3].rstrip()
|
|
raw = raw.strip()
|
|
|
|
try:
|
|
parsed = json.loads(raw)
|
|
except json.JSONDecodeError:
|
|
log.warning("review.parse_failed", preview=result.content[:200])
|
|
return Verdict(clean=False, reason="reviewer returned non-JSON",
|
|
cost_usd=result.cost_usd)
|
|
|
|
clean = parsed.get("clean")
|
|
reason = parsed.get("reason") or ""
|
|
if not isinstance(clean, bool):
|
|
return Verdict(clean=False, reason="reviewer omitted bool 'clean'",
|
|
cost_usd=result.cost_usd)
|
|
return Verdict(clean=clean, reason=str(reason)[:200], cost_usd=result.cost_usd)
|