cited / about

Engineering discipline made visible

A daily AI-engineering news briefing where every summary has a citation, and topics that find no credible source are refused rather than hallucinated. The point isn’t another AI digest — it’s the engineering discipline visible in the code.

how a briefing is made

Every day a Temporal workflow runs five focused web-search calls — one per topic — in priority-rank order. The first topic gets the field; each subsequent topic is told which URLs are already taken so it routes around them. Each call is forced through a JSON-schema-shaped tool so the model can’t return free-form prose: it must produce one short sentence, one URL, or refuse.

Two providers run on every topic in parallel — Anthropic’s Claude (canonical) and OpenAI’s GPT (comparative). Only the canonical set is published. The comparative set persists for the methodology post that comes after several weeks of running.

why each pattern is here

Eval gate before publish. Five blocking checks run between the LLM call and the database write — citation completeness, dead-link probe, recency, URL uniqueness, source diversity. A run that fails any of them gets runs.status='failed_eval'and no briefing row. The worst version of this project is one that demonstrates the failure mode the companion article warns about, on the author’s own domain. That cannot ship.
Refusal as a first-class outcome. If the LLM finds no credible primary source for a topic in the window, it returns zero items with refused: true. The slot is empty on the page — graceful degrade — and the run is logged as refused, not failed. Five real items beats five invented ones.
Three idempotency layers.A re-trigger of the same date is a no-op: Temporal’s REJECT_DUPLICATE policy on the workflow ID catches the parent; runs.idempotency_key UNIQUE catches individual children; briefings (topic, date, provider) UNIQUE catches the publish step. Cron retries can never double-publish.
Cost tracking surfaced in the UI. Every LLM call writes to cost_logs; the daily total appears in the footer of the briefing. A weekly budget kill-switch aborts new runs before they exceed it. Cost is a feature, not a hidden ops concern.
Pre-flight evals run in CI. A small fixture set in evals/fixtures/ exercises every blocking check against known-good and known-bad inputs. npm test guards the check logic itself before any deploy.

the methodology post

After four-to-six weeks of cited running daily, the comparative dataset (Anthropic vs. OpenAI on the same prompts and the same time windows) becomes the source data for a long-form post about what each web-search API actually gets right and wrong. That post lives at https://www.jeffreyquan.com when ready.