AI-x402 benchmarks testnet AI actions, ranking alignment with Hedera/Chainlink/Ledger practice.
x402Bench LLM Readiness is an open, reproducible benchmark that tests whether AI models can make safe, correct, and executable payment/workflow decisions across Hedera, Chainlink, and Ledger tracks at ETHGlobal Cannes 2026. Instead of scoring only text quality, x402Bench evaluates full decision quality (allow/block, approval requirement, priority, risk, required controls), then checks execution eligibility, and for real cases attempts live workflow actions on testnet-integrated paths. It runs each model in two comparable modes: with official protocol documentation context and without documentation context, so teams can measure true docs impact vs pretrained knowledge. Outputs include leaderboard scores, per-sponsor breakdowns, pass/fail counts, latency, and case-level traces showing exactly what was attempted, skipped, or failed and why. The goal is to create a public quality bar for AI-driven blockchain actions, not just AI-generated explanations.
Fastapi backend, nextjs frontend, real testnet usage. Used openrouter, openai, and ollama for AI inference.
We built x402Bench as a full benchmark pipeline with a Node.js runner, a FastAPI control API, and a Next.js dashboard.
The core engine executes a versioned readiness suite (readiness_bench/suite.json) where each case defines expected policy outputs, execution mode (real vs decision_only), sponsor challenge mapping, and required documentation sources.
For each model, x402Bench runs two comparable modes: with official protocol docs context and without docs context. The evaluator parses structured JSON outputs and scores decision correctness (allow/block, approval requirement, priority, risk, control set quality), strict full-match rate, execution-gate eligibility, workflow pass rate on real cases, and latency.
On partner tech integration:
The backend exposes run endpoints, idempotency handling, run-locking to prevent collisions, and report serving. Reports are written as both JSON and Markdown so results are machine-consumable and judge-readable.
A notable hackathon optimization was adding automatic local fallback wiring for integration endpoints in API-triggered runs, so the benchmark stays runnable end-to-end even when some explicit endpoints are not yet configured. We also pinned a live fallback dashboard snapshot to ensure the UI always opens with a valid latest leaderboard and case-level traces.

