news: auto-tag headlines + market-aware cadence + filter UI

- Move news_job from hourly to 3x/hour (cron 10,30,50), with a CadencePolicy gate that throttles to active hours (07-21 UTC weekdays at 20 min), off-hours (3 h), weekends (6 h). Keeps the daytime feed fresh without spamming RSS sources overnight. - Tag each headline on ingestion via DeepSeek (BATCH_SIZE=25, max_tokens=4000, json.JSONDecoder().raw_decode + per-row regex recovery for resilient parsing). Vocabulary: 16 tags including new EU / USA / AI / Conflict. NULL tags are picked up automatically on the next news_job run, so back-tagging is implicit rather than a separate migration step. - Tag UI: pill bar above the feed with off → include → exclude cycle on click; shift-click jumps straight to exclude. State persists in localStorage and is injected into /api/news requests via htmx:configRequest. Per-row chips sit to the right of the headline (new 5-column grid: age | source | title | tags | UTC) so vertical density stays high. - Strategic log header bug: model was hallucinating "(Updated 21:30 UTC)" in future tense. Bumped PROMPT_VERSION 6→7, added explicit ban on time-of-day clauses, and supply the actual current UTC time in the user prompt so the model has no need to invent one. Migration 0012 adds headlines.tags (JSON, nullable). Tests cover vocabulary integrity, validation/normalisation, and the JSON-recovery parser (17 tests).
2026-05-21 23:25:03 +01:00 · 2026-05-21 23:25:03 +01:00 · 2013bfa8cc
commit 2013bfa8cc
parent 6e7f57c6b2
15 changed files with 745 additions and 25 deletions
--- a/app/jobs/news_job.py
+++ b/app/jobs/news_job.py
@ -1,21 +1,36 @@
-"""Hourly news ingestion. Reads enabled feeds from the DB (not TOML — DB has
-the authoritative enabled/failure state). Per-ticker Yahoo news pulled for
-each symbol in the default portfolio group ('pie')."""
+"""News ingestion + AI tagging.
+
+Cron fires every 20 minutes. NEWS_POLICY gates the actual work:
+- Active window (07-21 UTC weekdays): always run (20-min gap)
+- Off-hours weekday: skip until 3h since last success
+- Weekend: skip until 6h since last success
+
+Each run does (a) fresh fetch of all enabled feeds + per-ticker Yahoo
+news, (b) bulk INSERT IGNORE into headlines, (c) batch-tags any rows
+still NULL via news_tagging. Untagged rows survive run failures and are
+retried automatically next cycle.
+"""
 from __future__ import annotations

 import asyncio

 import httpx
-from sqlalchemy import desc, select
+from sqlalchemy import desc, func, select, update
 from sqlalchemy.dialects.mysql import insert as mysql_insert

 from app.db import utcnow
 from app.jobs._helpers import job_lifecycle, log
-from app.models import Feed, Headline, InstrumentMap, TickerUniverse
+from app.models import Feed, Headline, InstrumentMap, JobRun, TickerUniverse
+from app.services.cadence import NEWS_POLICY
 from app.services.news import dedupe, fetch_feed, fetch_yahoo_news
+from app.services.news_tagging import ToTag, tag_titles


 AUTO_DISABLE_AT = 5
+# Cap on how many untagged headlines a single run will tag. Stops a
+# backlog from blowing the cost ledger if the tagger has been failing
+# for a while.
+TAG_PER_RUN_LIMIT = 200


 async def _process_feed(client: httpx.AsyncClient, feed: Feed) -> tuple[Feed, list]:
@ -38,6 +53,21 @@ async def run() -> None:
        if run.status == "skipped":
            return

+        # Market-aware cadence: skip this fire if too soon (off-hours /
+        # weekend). Active window still runs every 20 min.
+        last_success = (await session.execute(
+            select(func.max(JobRun.finished_at)).where(
+                JobRun.name == "news_job",
+                JobRun.status == "success",
+            )
+        )).scalar()
+        should_run, reason = NEWS_POLICY.should_run(last_success)
+        if not should_run:
+            log.info("news_job.cadence_skip", reason=reason)
+            run.status = "skipped"
+            run.error = reason
+            return
+
        feeds = (
            await session.execute(select(Feed).where(Feed.enabled == True))
        ).scalars().all()
@ -91,8 +121,35 @@ async def run() -> None:
            await session.execute(stmt)

        await session.commit()
+
+        # Tag any headlines still NULL — fresh inserts from this run plus
+        # any that failed to tag on previous runs. Bounded by
+        # TAG_PER_RUN_LIMIT so a long outage doesn't blow the cost ledger.
+        untagged_rows = (await session.execute(
+            select(Headline.id, Headline.title)
+            .where(Headline.tags.is_(None))
+            .order_by(desc(Headline.published_at))
+            .limit(TAG_PER_RUN_LIMIT)
+        )).all()
+        tagged_count = 0
+        if untagged_rows:
+            items = [ToTag(id=int(r.id), title=r.title) for r in untagged_rows]
+            tags_by_id = await tag_titles(items)
+            for hid, tags in tags_by_id.items():
+                await session.execute(
+                    update(Headline)
+                    .where(Headline.id == hid)
+                    .values(tags=tags)
+                )
+            tagged_count = len(tags_by_id)
+            await session.commit()
+
        run.items_written = len(headlines)
-        log.info("news_job.done", fetched=len(all_headlines), kept=len(headlines))
+        log.info(
+            "news_job.done",
+            fetched=len(all_headlines), kept=len(headlines),
+            untagged_seen=len(untagged_rows), tagged=tagged_count,
+        )


 if __name__ == "__main__":