How the autonomous pipeline produced this

Brand selection on this site was performed by an autonomous pipeline running in a single Claude Code session, not by hand-curation. The pipeline ran for ~85 minutes, made 396 model calls, and cost $98.21. This page documents what it did and how, so the methodology is auditable.

The architecture turns out to closely match DRIL (Distributed Research with Intelligent LLMs), the protocol proposed by Afonso, Galiani, Gálvez & Sosa in NBER w35188 (May 2026) for using LLM agents to produce auditable research datasets. The pipeline here was built before reading their paper; the comparison section below maps our terms to theirs and is honest about the gaps.

The work model

Everything is a task. Tasks live in a JSON queue. A driver pops the next runnable task, dispatches it to a worker, writes results back, and repeats until the queue is empty or a budget cap is hit.

Task types:

Type	Worker	Outputs
`discover_webfetch`	`claude -p` + WebFetch	One source's brand list
`discover_browser`	`claude -p` + chrome-devtools MCP (headless, isolated)	Same, for JS/Cloudflare-gated sources
`merge_lists`	pure jq	Deduplicated `brands.json`
`enrich_brand`	`claude -p` + WebSearch/WebFetch	Founding year, ownership, extension verdict
`verify_extension`	`claude -p` (Opus, sharper prompt)	Second-opinion on low-confidence verdicts
`query`	`claude -p` (no tools)	Final markdown answer
`finalize`	jq + bash	Run summary

Each task carries state, attempts, max_attempts, cost_usd, and result_path fields. The state machine: pending → in_progress → succeeded, with failed tasks re-queued (incrementing attempts) up to max_attempts, then escalated or marked manual_review.

Discovery: WebFetch vs. headless browser

Sources have an extraction_method hint: webfetch, browser, or auto.

WebFetch is cheap and fast. It works for most sources where brand names appear in HTML text or accessible JSON.
Headless browser is required for Cloudflare-gated, JS-rendered, or chart-rendered sources (Bain's Insurgent Brands editions are the canonical case — the brand names live in SVG <text> nodes inside a chart that loads after page render, and Cloudflare blocks most non-browser user agents).

The browser worker invokes claude -p with chrome-devtools-mcp running in --isolated --headless mode (each subprocess gets a temp user-data-dir so multiple workers can run in parallel). Three extraction strategies in order:

DOM/SVG text: query all <text> nodes inside the chart container.
Network capture: inspect XHR/fetch responses during page load. Charts often pull JSON from a data endpoint.
Screenshot + vision: page.screenshot() of the chart, hand to a vision-capable claude -p call to read brand names.

If WebFetch returns completeness: partial or sample, the task auto-escalates to discover_browser for the same source. If browser fails three times, the source is marked manual_review: true and surfaced in the Sources page.

Enrichment: per-brand research

For each unique brand, an enrich_brand task runs claude -p with WebSearch and WebFetch tools. The model is asked to fill in:

founded_year, founder, first_retail_year
us_market (boolean)
is_extension (boolean) with extension_reasoning (why)
ownership_status (independent / acquired / public), acquired_by, acquired_year
latest_revenue_usd, valuation_usd
sources[] — list of URLs cited per field

Each output field also carries a confidence rating (high / medium / low). Low confidence on is_extension auto-enqueues a verify_extension task using a different model (Opus 4.7 instead of Haiku) and a sharper prompt that lists known edge cases (Athletic Greens, Ghost, Liquid Death-style category creators). If the verifier disagrees with the enricher, the brand is marked manual_review: true — it stays in the database but is flagged.

Parallelism

Default MAX_WORKERS=5, env-overridable. Implementation: xargs -P $MAX_WORKERS reading task IDs from the queue. Each worker is independent; they touch different rows in brands.json. Writes go through a flock-guarded merge helper to avoid lost updates.

Cost and budget

Every claude -p call returns its cost in the JSON wrapper. The dispatcher accumulates total cost in run-state.json. Before dispatching any task, it checks total_cost < BUDGET_USD. Hitting budget is a graceful halt — the final report is written with whatever's done, and the run can be resumed later.

This run:

Total cost: $98.21
Model calls: 396
Budget: $1,000
Runtime: ~85 minutes (started 20:22 UTC, finalized 21:47 UTC)

Average enrichment cost per brand: ~$0.23. Average discovery cost per source: ~$0.65 (WebFetch) / ~$1.50 (browser).

Termination and audit

The loop exits when (pending_tasks == 0 AND escalated_tasks == 0) OR budget_exhausted. Then finalize.sh runs, which:

Ensures the query task for the headline question has run.
Counts: sources covered, sources skipped, brands found, brands enriched, brands flagged manual_review.
Writes a complete-run report with numerical summary, the answer to the question, the manual-review section, the skipped-sources section, total cost, runtime, and call count.
Posts a one-line summary via push notification.

Every brand in the database has a sources[] array citing the URLs the model used to fill each field. Every model call has a JSONL log entry in logs/. The complete run-state.json, queue.json, and brands.json are preserved.

Comparison to DRIL

DRIL is a protocol for LLM-mediated research data collection: a design stage produces a frozen instrument (codebook, mapped unit space, evidence policy, citation contract, data-quality mechanisms), an implementation stage executes that instrument with logged agent calls, and a verification stage audits the output. The paper applies it to harmonized cross-country fiscal data (260 model calls, 8 countries, $21). Our pipeline applies the same architecture to "what was the last giant CPG hit," and we landed on most of the same building blocks independently.

The mapping (their term → our term)

DRIL building block	Our equivalent	Notes
Research instrument (codebook)	`criteria.md` + per-worker prompts in `pipeline/prompts/`	Ours is split across files; theirs is a single frozen artifact.
Mapped unit space	`data/sources.json` (lists) → `data/brands.json` (brands) — two-level	They use country × year; we use list × brand. Theirs is pre-enumerated (ISO codes); ours emerges from the discovery phase.
Evidence policy / source hierarchy	`tier` field on sources (tier-1 Bain/Numerator/Circana/NielsenIQ; tier-2 Pear/Food Institute/Food Dive/Fast Company)	We rank list-source authority but don't tag enrichment URLs (Forbes, Bloomberg, Wikipedia) with a role in a hierarchy.
Data-quality mechanisms	`confidence ∈ {high, medium, low}` per field; `manual_review: true` flag	We have the uncertainty taxonomy. We do not have explicit data-gap records with negative-search documentation.
Citation contract	`sources[]` per brand, binding URL + access date + publisher to specific fields	We have URL + access date. We're missing verbatim quotes and precise locators (the parts that defend against hallucination without re-fetching).
Frozen protocol	`data/run-state.json` + per-call logs in `logs/`	We log every `claude -p` call as JSON. We don't snapshot the instrument version alongside outputs.
Two-stage architecture (design / implementation)	Conflated — we hand-wrote the design	They use a design-agent to generate the instrument from a natural-language objective. We are the design agent.
Verification stage	`verify_extension` task type (Opus)	Spec'd in `DESIGN.md`, not implemented. They also skipped theirs — and admit "for any downstream use, the verification stage is not optional."
Append-only observation ledger	None — we overwrite `brands.json` (`flock`-guarded)	History is recoverable from per-call logs but isn't a queryable ledger.

Where we're weaker than DRIL

No verbatim quotes in citations. Our enrichment cites the Forbes article that says Olipop was founded in 2018, but doesn't store the sentence. An auditor has to re-fetch and re-read to verify. DRIL's whole hallucination defense rests on the verbatim quote — "an invented quote will not match the cited source upon verification."
No source-role tagging on enrichment URLs. Our sources.json ranks list sources by tier, but enrichment URLs aren't ranked. DRIL would have these in a hierarchy: international organization > official government > professional firm > specialized press.
No explicit data-gap records. When enrichment can't find a founding year, the field is null. DRIL requires a record with reason ∈ {not_found_after_search, not_applicable, unclear_definition, conflict_unresolved, out_of_scope} and the queries tried. We have per-call search logs but they aren't linked to specific empty fields.
No verification stage executed. Same as them — but they admit this disqualifies their dataset from publication-grade use. Our verify_extension is spec'd in DESIGN.md, the auto-enqueue path is unimplemented, and enrich.sh uses Haiku without ever escalating to Opus. The edge cases in Criteria (Athletic Greens, Ghost, Liquid Death) are exactly the disagreements a verification pass should adjudicate.
No frozen instrument. Their design stage produces a versioned artifact locked before execution. Our prompts in pipeline/prompts/ are edited in place; reconstructing the instrument that produced a given enrichment row requires reading git history. The paper calls this out directly: "protocol versioning and re-execution so that design-stage corrections can be applied and their effects tracked without discarding the original run."
Schema drift between brands. Some sources[] entries are field-keyed ({field, url, publisher, accessed}); others are source-keyed ({url, title, fields[]}). Different enrichment runs produced different shapes. Worth normalizing.

Where we're stronger or different

Two-level unit space with emergent units. They run one agent per country (pre-enumerated). We run discovery (per source → list of brands) → merge → enrichment (per brand → fields). The merge step is a pure-jq normalizer (brand | ascii_downcase | gsub("[^a-z0-9]"; "")); they don't need this because their units are ISO codes. For us, the unit space emerges from discovery. The paper doesn't discuss this mode.
Heterogeneous extraction within one task type. discover_browser tries DOM/SVG text → network capture → screenshot + vision in sequence (the Bain Insurgent Brands chart needs at least two of these). DRIL implicitly assumes the agent's host provides one search/read capability.
Cost gate, not just cost telemetry. We read total_cost_usd from --output-format json and hard-halt at BUDGET_USD. Their cost figure ($21) is post-hoc reporting, not a budget gate.
Parallelism via drainers + directory-lock queue. pipeline/drainer.sh w1 w2 … against a mkdir-locked queue is more operational than anything in the paper. They ran sequentially.
Agentic autonomy at Tier III. Workers are handed a target and a prompt and decide what to search, which sources to trust, and what to extract. The discover_webfetch → discover_browser escalation is what the paper calls "the agent exercises judgment about where to look, what to trust, and how to handle the unexpected" — but the paper runs on Codex CLI under a flat ChatGPT subscription; our chrome-devtools-MCP-in-subprocess pattern (--isolated --headless, --strict-mcp-config) is a more aggressive tooling story.

What would change after reading them

Three concrete additions, in priority order:

Verbatim quotes in the citation contract. Extend the enrichment schema so each non-trivial field carries {value, confidence, evidence: [{quote, url, accessed_at, locator?}]}.
Implement verify_extension. Wire the auto-enqueue from is_extension_confidence: low into dispatch.sh, route to Opus, surface disagreements as manual_review. DRIL's own Argentina-VAT example is the kind of internally-consistent-but-judgment-call situation the verification pass exists for.
Explicit gap records on enrichment. When a field is unfilled, attach {reason, queries_tried, sources_checked} instead of null. Converts "we don't know" from a silence into a queryable signal.

The paper validates the architecture. The implementation here is closer to theirs than expected; the gaps are the ones they themselves admit (no verification), plus the three above.

What this guarantees, and what it doesn't

Guarantees:

No brand was added or excluded by hand-picking. The criteria are fixed and applied uniformly.
Every classification is traceable to source URLs.
The dataset is reproducible — re-running the loop with the same sources will produce a comparable dataset (model nondeterminism aside).

Doesn't guarantee:

That the source lists themselves are exhaustive. If a brand isn't on Bain, Numerator, Circana, Inc. 5000 F&B, Pear Commerce, Food Institute, or Food Dive, it isn't here. See Sources for the full list.
That every model classification is correct. The verify_extension second-opinion catches obvious disagreements; subtler edge cases are flagged as manual_review (currently 0 in this run, with 5 brands flagged for incomplete data — see the Brands page).
That the data won't drift. Acquisitions, valuations, and revenue change. The dataset reflects state as of the run timestamp.

Source code and design doc

The full design document is in the project repository at DESIGN.md. Key files:

pipeline/loop.sh — top-level driver
pipeline/dispatch.sh — one task → one worker
pipeline/workers/*.sh — per-task-type implementations
pipeline/prompts/*.md — prompt templates passed to claude -p
pipeline/schemas/*.json — JSON schemas for validation
pipeline/lib/chrome-devtools-isolated.json — MCP config for browser workers