Benchmark philosophy
Primitive402 starts quality measurement with focused fixture evals that reflect real failure modes: prompt injection, source-grounded claim checks, return policies, subscription terms, safe fetch extraction, and page proof metadata.
The fixtures assert key fields instead of full response snapshots so tests catch meaningful regressions while avoiding brittle changes to hashes, snippets, ordering, or wording.
Run locally
pnpm eval:tools
The runner uses deterministic mocked content and does not call live external websites, paid x402 routes, or a live LLM provider.
Repository guide: docs/quality/benchmarks.md.
Current coverage
- Prompt injection: instruction override, secret exfiltration, benign text, and hidden/obfuscated instruction markers.
- Claim verification: supported, contradicted, not addressed, and noisy irrelevant source text.
- Return policy extraction: Target-like 90-day text, final sale, exchange-only, no-policy, and blocked-page text.
- Subscription terms extraction: trial, monthly billing, annual auto-renewal, cancel-anytime, no-refund, no-subscription, and blocked-page text.
- Safe fetch and page proof: mocked/static content fixtures for extraction, risk scanning, proof metadata, and artifact toggles.
Known limitations
These evals do not prove accuracy on every website, merchant policy, claim, or adversarial input. Live websites can change, block requests, render content with JavaScript, personalize content, or expose policy text differently from deterministic fixtures.
Blocked, empty, or JavaScript shell pages should produce low-confidence or no-term outputs instead of hallucinated policy fields. Confidence is a signal about extracted evidence strength, not a guarantee of correctness.
Primitive402 should not invent facts that are not present in source text. When content does not address a claim or policy, outputs should remain conservative.