Quality

Quality benchmarks

Small deterministic fixtures for measuring current tool behavior and catching regressions without claiming perfect accuracy.

Current production mode: Base mainnet USDC billing beta on primitive402.dev.

status.json is the live source Billing policy Mainnet readiness Mainnet monitoring

Benchmark philosophy

Primitive402 starts quality measurement with focused fixture evals that reflect real failure modes: prompt injection, source-grounded claim checks, return policies, subscription terms, safe fetch extraction, and page proof metadata.

The fixtures assert key fields instead of full response snapshots so tests catch meaningful regressions while avoiding brittle changes to hashes, snippets, ordering, or wording.

Run locally

pnpm eval:tools

The runner uses deterministic mocked content and does not call live external websites, paid x402 routes, or a live LLM provider.

Repository guide: docs/quality/benchmarks.md.

Current coverage

Prompt injection: instruction override, secret exfiltration, benign text, and hidden/obfuscated instruction markers.
Claim verification: supported, contradicted, not addressed, and noisy irrelevant source text.
Return policy extraction: Target-like 90-day text, final sale, exchange-only, no-policy, and blocked-page text.
Subscription terms extraction: trial, monthly billing, annual auto-renewal, cancel-anytime, no-refund, no-subscription, and blocked-page text.
Safe fetch and page proof: mocked/static content fixtures for extraction, risk scanning, proof metadata, and artifact toggles.

Product-fit structured data roadmap

Product-fit extraction parses schema.org Product/Offer data, Open Graph product metadata, and bounded embedded product JSON where safe, then uses those signals for conservative price and spec extraction.

The roadmap still requires no hallucinated prices, visible text corroboration for product title, price, availability, specs, and contradictions, and conservative handling when metadata conflicts with page text.

Repository plan: docs/quality/product-fit-structured-data-plan.md.

Known limitations

These evals do not prove accuracy on every website, merchant policy, claim, or adversarial input. Live websites can change, block requests, render content with JavaScript, personalize content, or expose policy text differently from deterministic fixtures.

Blocked, empty, or JavaScript shell pages should produce low-confidence or no-term outputs instead of hallucinated policy fields. Confidence is a signal about extracted evidence strength, not a guarantee of correctness.

Primitive402 should not invent facts that are not present in source text. When content does not address a claim or policy, outputs should remain conservative.