Brand selection on this site was performed by an autonomous pipeline running in a single Claude Code session, not by hand-curation. The pipeline ran for ~85 minutes, made 396 model calls, and cost $98.21. This page documents what it did and how, so the methodology is auditable.
The architecture turns out to closely match DRIL (Distributed Research with Intelligent LLMs), the protocol proposed by Afonso, Galiani, Gálvez & Sosa in NBER w35188 (May 2026) for using LLM agents to produce auditable research datasets. The pipeline here was built before reading their paper; the comparison section below maps our terms to theirs and is honest about the gaps.
The work model
Everything is a task. Tasks live in a JSON queue. A driver pops the next runnable task, dispatches it to a worker, writes results back, and repeats until the queue is empty or a budget cap is hit.
Task types:
| Type | Worker | Outputs |
|---|---|---|
discover_webfetch |
claude -p + WebFetch |
One source's brand list |
discover_browser |
claude -p + chrome-devtools MCP (headless, isolated) |
Same, for JS/Cloudflare-gated sources |
merge_lists |
pure jq | Deduplicated brands.json |
enrich_brand |
claude -p + WebSearch/WebFetch |
Founding year, ownership, extension verdict |
verify_extension |
claude -p (Opus, sharper prompt) |
Second-opinion on low-confidence verdicts |
query |
claude -p (no tools) |
Final markdown answer |
finalize |
jq + bash | Run summary |
Each task carries state, attempts, max_attempts, cost_usd, and result_path fields. The state machine: pending → in_progress → succeeded, with failed tasks re-queued (incrementing attempts) up to max_attempts, then escalated or marked manual_review.
Discovery: WebFetch vs. headless browser
Sources have an extraction_method hint: webfetch, browser, or auto.
- WebFetch is cheap and fast. It works for most sources where brand names appear in HTML text or accessible JSON.
- Headless browser is required for Cloudflare-gated, JS-rendered, or chart-rendered sources (Bain's Insurgent Brands editions are the canonical case — the brand names live in SVG
<text>nodes inside a chart that loads after page render, and Cloudflare blocks most non-browser user agents).
The browser worker invokes claude -p with chrome-devtools-mcp running in --isolated --headless mode (each subprocess gets a temp user-data-dir so multiple workers can run in parallel). Three extraction strategies in order:
- DOM/SVG text: query all
<text>nodes inside the chart container. - Network capture: inspect XHR/fetch responses during page load. Charts often pull JSON from a data endpoint.
- Screenshot + vision:
page.screenshot()of the chart, hand to a vision-capableclaude -pcall to read brand names.
If WebFetch returns completeness: partial or sample, the task auto-escalates to discover_browser for the same source. If browser fails three times, the source is marked manual_review: true and surfaced in the Sources page.
Enrichment: per-brand research
For each unique brand, an enrich_brand task runs claude -p with WebSearch and WebFetch tools. The model is asked to fill in:
founded_year,founder,first_retail_yearus_market(boolean)is_extension(boolean) withextension_reasoning(why)ownership_status(independent / acquired / public),acquired_by,acquired_yearlatest_revenue_usd,valuation_usdsources[]— list of URLs cited per field
Each output field also carries a confidence rating (high / medium / low). Low confidence on is_extension auto-enqueues a verify_extension task using a different model (Opus 4.7 instead of Haiku) and a sharper prompt that lists known edge cases (Athletic Greens, Ghost, Liquid Death-style category creators). If the verifier disagrees with the enricher, the brand is marked manual_review: true — it stays in the database but is flagged.
Parallelism
Default MAX_WORKERS=5, env-overridable. Implementation: xargs -P $MAX_WORKERS reading task IDs from the queue. Each worker is independent; they touch different rows in brands.json. Writes go through a flock-guarded merge helper to avoid lost updates.
Cost and budget
Every claude -p call returns its cost in the JSON wrapper. The dispatcher accumulates total cost in run-state.json. Before dispatching any task, it checks total_cost < BUDGET_USD. Hitting budget is a graceful halt — the final report is written with whatever's done, and the run can be resumed later.
This run:
- Total cost: $98.21
- Model calls: 396
- Budget: $1,000
- Runtime: ~85 minutes (started 20:22 UTC, finalized 21:47 UTC)
Average enrichment cost per brand: ~$0.23. Average discovery cost per source: ~$0.65 (WebFetch) / ~$1.50 (browser).
Termination and audit
The loop exits when (pending_tasks == 0 AND escalated_tasks == 0) OR budget_exhausted. Then finalize.sh runs, which:
- Ensures the
querytask for the headline question has run. - Counts: sources covered, sources skipped, brands found, brands enriched, brands flagged
manual_review. - Writes a complete-run report with numerical summary, the answer to the question, the manual-review section, the skipped-sources section, total cost, runtime, and call count.
- Posts a one-line summary via push notification.
Every brand in the database has a sources[] array citing the URLs the model used to fill each field. Every model call has a JSONL log entry in logs/. The complete run-state.json, queue.json, and brands.json are preserved.
Comparison to DRIL
DRIL is a protocol for LLM-mediated research data collection: a design stage produces a frozen instrument (codebook, mapped unit space, evidence policy, citation contract, data-quality mechanisms), an implementation stage executes that instrument with logged agent calls, and a verification stage audits the output. The paper applies it to harmonized cross-country fiscal data (260 model calls, 8 countries, $21). Our pipeline applies the same architecture to "what was the last giant CPG hit," and we landed on most of the same building blocks independently.
The mapping (their term → our term)
| DRIL building block | Our equivalent | Notes |
|---|---|---|
| Research instrument (codebook) | criteria.md + per-worker prompts in pipeline/prompts/ |
Ours is split across files; theirs is a single frozen artifact. |
| Mapped unit space | data/sources.json (lists) → data/brands.json (brands) — two-level |
They use country × year; we use list × brand. Theirs is pre-enumerated (ISO codes); ours emerges from the discovery phase. |
| Evidence policy / source hierarchy | tier field on sources (tier-1 Bain/Numerator/Circana/NielsenIQ; tier-2 Pear/Food Institute/Food Dive/Fast Company) |
We rank list-source authority but don't tag enrichment URLs (Forbes, Bloomberg, Wikipedia) with a role in a hierarchy. |
| Data-quality mechanisms | confidence ∈ {high, medium, low} per field; manual_review: true flag |
We have the uncertainty taxonomy. We do not have explicit data-gap records with negative-search documentation. |
| Citation contract | sources[] per brand, binding URL + access date + publisher to specific fields |
We have URL + access date. We're missing verbatim quotes and precise locators (the parts that defend against hallucination without re-fetching). |
| Frozen protocol | data/run-state.json + per-call logs in logs/ |
We log every claude -p call as JSON. We don't snapshot the instrument version alongside outputs. |
| Two-stage architecture (design / implementation) | Conflated — we hand-wrote the design | They use a design-agent to generate the instrument from a natural-language objective. We are the design agent. |
| Verification stage | verify_extension task type (Opus) |
Spec'd in DESIGN.md, not implemented. They also skipped theirs — and admit "for any downstream use, the verification stage is not optional." |
| Append-only observation ledger | None — we overwrite brands.json (flock-guarded) |
History is recoverable from per-call logs but isn't a queryable ledger. |
Where we're weaker than DRIL
No verbatim quotes in citations. Our enrichment cites the Forbes article that says Olipop was founded in 2018, but doesn't store the sentence. An auditor has to re-fetch and re-read to verify. DRIL's whole hallucination defense rests on the verbatim quote — "an invented quote will not match the cited source upon verification."
No source-role tagging on enrichment URLs. Our
sources.jsonranks list sources by tier, but enrichment URLs aren't ranked. DRIL would have these in a hierarchy: international organization > official government > professional firm > specialized press.No explicit data-gap records. When enrichment can't find a founding year, the field is null. DRIL requires a record with reason ∈
{not_found_after_search, not_applicable, unclear_definition, conflict_unresolved, out_of_scope}and the queries tried. We have per-call search logs but they aren't linked to specific empty fields.No verification stage executed. Same as them — but they admit this disqualifies their dataset from publication-grade use. Our
verify_extensionis spec'd inDESIGN.md, the auto-enqueue path is unimplemented, andenrich.shuses Haiku without ever escalating to Opus. The edge cases in Criteria (Athletic Greens, Ghost, Liquid Death) are exactly the disagreements a verification pass should adjudicate.No frozen instrument. Their design stage produces a versioned artifact locked before execution. Our prompts in
pipeline/prompts/are edited in place; reconstructing the instrument that produced a given enrichment row requires reading git history. The paper calls this out directly: "protocol versioning and re-execution so that design-stage corrections can be applied and their effects tracked without discarding the original run."Schema drift between brands. Some
sources[]entries are field-keyed ({field, url, publisher, accessed}); others are source-keyed ({url, title, fields[]}). Different enrichment runs produced different shapes. Worth normalizing.
Where we're stronger or different
Two-level unit space with emergent units. They run one agent per country (pre-enumerated). We run discovery (per source → list of brands) → merge → enrichment (per brand → fields). The merge step is a pure-jq normalizer (
brand | ascii_downcase | gsub("[^a-z0-9]"; "")); they don't need this because their units are ISO codes. For us, the unit space emerges from discovery. The paper doesn't discuss this mode.Heterogeneous extraction within one task type.
discover_browsertries DOM/SVG text → network capture → screenshot + vision in sequence (the Bain Insurgent Brands chart needs at least two of these). DRIL implicitly assumes the agent's host provides one search/read capability.Cost gate, not just cost telemetry. We read
total_cost_usdfrom--output-format jsonand hard-halt atBUDGET_USD. Their cost figure ($21) is post-hoc reporting, not a budget gate.Parallelism via drainers + directory-lock queue.
pipeline/drainer.sh w1 w2 …against amkdir-locked queue is more operational than anything in the paper. They ran sequentially.Agentic autonomy at Tier III. Workers are handed a target and a prompt and decide what to search, which sources to trust, and what to extract. The
discover_webfetch→discover_browserescalation is what the paper calls "the agent exercises judgment about where to look, what to trust, and how to handle the unexpected" — but the paper runs on Codex CLI under a flat ChatGPT subscription; our chrome-devtools-MCP-in-subprocess pattern (--isolated --headless,--strict-mcp-config) is a more aggressive tooling story.
What would change after reading them
Three concrete additions, in priority order:
Verbatim quotes in the citation contract. Extend the enrichment schema so each non-trivial field carries
{value, confidence, evidence: [{quote, url, accessed_at, locator?}]}.Implement
verify_extension. Wire the auto-enqueue fromis_extension_confidence: lowintodispatch.sh, route to Opus, surface disagreements asmanual_review. DRIL's own Argentina-VAT example is the kind of internally-consistent-but-judgment-call situation the verification pass exists for.Explicit gap records on enrichment. When a field is unfilled, attach
{reason, queries_tried, sources_checked}instead ofnull. Converts "we don't know" from a silence into a queryable signal.
The paper validates the architecture. The implementation here is closer to theirs than expected; the gaps are the ones they themselves admit (no verification), plus the three above.
What this guarantees, and what it doesn't
Guarantees:
- No brand was added or excluded by hand-picking. The criteria are fixed and applied uniformly.
- Every classification is traceable to source URLs.
- The dataset is reproducible — re-running the loop with the same sources will produce a comparable dataset (model nondeterminism aside).
Doesn't guarantee:
- That the source lists themselves are exhaustive. If a brand isn't on Bain, Numerator, Circana, Inc. 5000 F&B, Pear Commerce, Food Institute, or Food Dive, it isn't here. See Sources for the full list.
- That every model classification is correct. The
verify_extensionsecond-opinion catches obvious disagreements; subtler edge cases are flagged asmanual_review(currently 0 in this run, with 5 brands flagged for incomplete data — see the Brands page). - That the data won't drift. Acquisitions, valuations, and revenue change. The dataset reflects state as of the run timestamp.
Source code and design doc
The full design document is in the project repository at DESIGN.md. Key files:
pipeline/loop.sh— top-level driverpipeline/dispatch.sh— one task → one workerpipeline/workers/*.sh— per-task-type implementationspipeline/prompts/*.md— prompt templates passed toclaude -ppipeline/schemas/*.json— JSON schemas for validationpipeline/lib/chrome-devtools-isolated.json— MCP config for browser workers