Skip to main content
Breakout Brands Timeline

How the autonomous pipeline produced this

Brand selection was performed by the pipeline, not by hand-curation.

Brand selection on this site was performed by an autonomous pipeline running in a single Claude Code session, not by hand-curation. The pipeline ran for ~85 minutes, made 396 model calls, and cost $98.21. This page documents what it did and how, so the methodology is auditable.

The architecture turns out to closely match DRIL (Distributed Research with Intelligent LLMs), the protocol proposed by Afonso, Galiani, Gálvez & Sosa in NBER w35188 (May 2026) for using LLM agents to produce auditable research datasets. The pipeline here was built before reading their paper; the comparison section below maps our terms to theirs and is honest about the gaps.


The work model

Everything is a task. Tasks live in a JSON queue. A driver pops the next runnable task, dispatches it to a worker, writes results back, and repeats until the queue is empty or a budget cap is hit.

Task types:

Type Worker Outputs
discover_webfetch claude -p + WebFetch One source's brand list
discover_browser claude -p + chrome-devtools MCP (headless, isolated) Same, for JS/Cloudflare-gated sources
merge_lists pure jq Deduplicated brands.json
enrich_brand claude -p + WebSearch/WebFetch Founding year, ownership, extension verdict
verify_extension claude -p (Opus, sharper prompt) Second-opinion on low-confidence verdicts
query claude -p (no tools) Final markdown answer
finalize jq + bash Run summary

Each task carries state, attempts, max_attempts, cost_usd, and result_path fields. The state machine: pending → in_progress → succeeded, with failed tasks re-queued (incrementing attempts) up to max_attempts, then escalated or marked manual_review.


Discovery: WebFetch vs. headless browser

Sources have an extraction_method hint: webfetch, browser, or auto.

The browser worker invokes claude -p with chrome-devtools-mcp running in --isolated --headless mode (each subprocess gets a temp user-data-dir so multiple workers can run in parallel). Three extraction strategies in order:

  1. DOM/SVG text: query all <text> nodes inside the chart container.
  2. Network capture: inspect XHR/fetch responses during page load. Charts often pull JSON from a data endpoint.
  3. Screenshot + vision: page.screenshot() of the chart, hand to a vision-capable claude -p call to read brand names.

If WebFetch returns completeness: partial or sample, the task auto-escalates to discover_browser for the same source. If browser fails three times, the source is marked manual_review: true and surfaced in the Sources page.


Enrichment: per-brand research

For each unique brand, an enrich_brand task runs claude -p with WebSearch and WebFetch tools. The model is asked to fill in:

Each output field also carries a confidence rating (high / medium / low). Low confidence on is_extension auto-enqueues a verify_extension task using a different model (Opus 4.7 instead of Haiku) and a sharper prompt that lists known edge cases (Athletic Greens, Ghost, Liquid Death-style category creators). If the verifier disagrees with the enricher, the brand is marked manual_review: true — it stays in the database but is flagged.


Parallelism

Default MAX_WORKERS=5, env-overridable. Implementation: xargs -P $MAX_WORKERS reading task IDs from the queue. Each worker is independent; they touch different rows in brands.json. Writes go through a flock-guarded merge helper to avoid lost updates.


Cost and budget

Every claude -p call returns its cost in the JSON wrapper. The dispatcher accumulates total cost in run-state.json. Before dispatching any task, it checks total_cost < BUDGET_USD. Hitting budget is a graceful halt — the final report is written with whatever's done, and the run can be resumed later.

This run:

Average enrichment cost per brand: ~$0.23. Average discovery cost per source: ~$0.65 (WebFetch) / ~$1.50 (browser).


Termination and audit

The loop exits when (pending_tasks == 0 AND escalated_tasks == 0) OR budget_exhausted. Then finalize.sh runs, which:

  1. Ensures the query task for the headline question has run.
  2. Counts: sources covered, sources skipped, brands found, brands enriched, brands flagged manual_review.
  3. Writes a complete-run report with numerical summary, the answer to the question, the manual-review section, the skipped-sources section, total cost, runtime, and call count.
  4. Posts a one-line summary via push notification.

Every brand in the database has a sources[] array citing the URLs the model used to fill each field. Every model call has a JSONL log entry in logs/. The complete run-state.json, queue.json, and brands.json are preserved.


Comparison to DRIL

DRIL is a protocol for LLM-mediated research data collection: a design stage produces a frozen instrument (codebook, mapped unit space, evidence policy, citation contract, data-quality mechanisms), an implementation stage executes that instrument with logged agent calls, and a verification stage audits the output. The paper applies it to harmonized cross-country fiscal data (260 model calls, 8 countries, $21). Our pipeline applies the same architecture to "what was the last giant CPG hit," and we landed on most of the same building blocks independently.

The mapping (their term → our term)

DRIL building block Our equivalent Notes
Research instrument (codebook) criteria.md + per-worker prompts in pipeline/prompts/ Ours is split across files; theirs is a single frozen artifact.
Mapped unit space data/sources.json (lists) → data/brands.json (brands) — two-level They use country × year; we use list × brand. Theirs is pre-enumerated (ISO codes); ours emerges from the discovery phase.
Evidence policy / source hierarchy tier field on sources (tier-1 Bain/Numerator/Circana/NielsenIQ; tier-2 Pear/Food Institute/Food Dive/Fast Company) We rank list-source authority but don't tag enrichment URLs (Forbes, Bloomberg, Wikipedia) with a role in a hierarchy.
Data-quality mechanisms confidence ∈ {high, medium, low} per field; manual_review: true flag We have the uncertainty taxonomy. We do not have explicit data-gap records with negative-search documentation.
Citation contract sources[] per brand, binding URL + access date + publisher to specific fields We have URL + access date. We're missing verbatim quotes and precise locators (the parts that defend against hallucination without re-fetching).
Frozen protocol data/run-state.json + per-call logs in logs/ We log every claude -p call as JSON. We don't snapshot the instrument version alongside outputs.
Two-stage architecture (design / implementation) Conflated — we hand-wrote the design They use a design-agent to generate the instrument from a natural-language objective. We are the design agent.
Verification stage verify_extension task type (Opus) Spec'd in DESIGN.md, not implemented. They also skipped theirs — and admit "for any downstream use, the verification stage is not optional."
Append-only observation ledger None — we overwrite brands.json (flock-guarded) History is recoverable from per-call logs but isn't a queryable ledger.

Where we're weaker than DRIL

  1. No verbatim quotes in citations. Our enrichment cites the Forbes article that says Olipop was founded in 2018, but doesn't store the sentence. An auditor has to re-fetch and re-read to verify. DRIL's whole hallucination defense rests on the verbatim quote — "an invented quote will not match the cited source upon verification."

  2. No source-role tagging on enrichment URLs. Our sources.json ranks list sources by tier, but enrichment URLs aren't ranked. DRIL would have these in a hierarchy: international organization > official government > professional firm > specialized press.

  3. No explicit data-gap records. When enrichment can't find a founding year, the field is null. DRIL requires a record with reason ∈ {not_found_after_search, not_applicable, unclear_definition, conflict_unresolved, out_of_scope} and the queries tried. We have per-call search logs but they aren't linked to specific empty fields.

  4. No verification stage executed. Same as them — but they admit this disqualifies their dataset from publication-grade use. Our verify_extension is spec'd in DESIGN.md, the auto-enqueue path is unimplemented, and enrich.sh uses Haiku without ever escalating to Opus. The edge cases in Criteria (Athletic Greens, Ghost, Liquid Death) are exactly the disagreements a verification pass should adjudicate.

  5. No frozen instrument. Their design stage produces a versioned artifact locked before execution. Our prompts in pipeline/prompts/ are edited in place; reconstructing the instrument that produced a given enrichment row requires reading git history. The paper calls this out directly: "protocol versioning and re-execution so that design-stage corrections can be applied and their effects tracked without discarding the original run."

  6. Schema drift between brands. Some sources[] entries are field-keyed ({field, url, publisher, accessed}); others are source-keyed ({url, title, fields[]}). Different enrichment runs produced different shapes. Worth normalizing.

Where we're stronger or different

What would change after reading them

Three concrete additions, in priority order:

  1. Verbatim quotes in the citation contract. Extend the enrichment schema so each non-trivial field carries {value, confidence, evidence: [{quote, url, accessed_at, locator?}]}.

  2. Implement verify_extension. Wire the auto-enqueue from is_extension_confidence: low into dispatch.sh, route to Opus, surface disagreements as manual_review. DRIL's own Argentina-VAT example is the kind of internally-consistent-but-judgment-call situation the verification pass exists for.

  3. Explicit gap records on enrichment. When a field is unfilled, attach {reason, queries_tried, sources_checked} instead of null. Converts "we don't know" from a silence into a queryable signal.

The paper validates the architecture. The implementation here is closer to theirs than expected; the gaps are the ones they themselves admit (no verification), plus the three above.


What this guarantees, and what it doesn't

Guarantees:

Doesn't guarantee:


Source code and design doc

The full design document is in the project repository at DESIGN.md. Key files: