AI X402 Benchmark

AI-x402 benchmarks testnet AI actions, ranking alignment with Hedera/Chainlink/Ledger practice.

AI X402 Benchmark

Created At

ETHGlobal Cannes 2026

Project Description

x402Bench LLM Readiness is an open, reproducible benchmark that tests whether AI models can make safe, correct, and executable payment/workflow decisions across Hedera, Chainlink, and Ledger tracks at ETHGlobal Cannes 2026. Instead of scoring only text quality, x402Bench evaluates full decision quality (allow/block, approval requirement, priority, risk, required controls), then checks execution eligibility, and for real cases attempts live workflow actions on testnet-integrated paths. It runs each model in two comparable modes: with official protocol documentation context and without documentation context, so teams can measure true docs impact vs pretrained knowledge. Outputs include leaderboard scores, per-sponsor breakdowns, pass/fail counts, latency, and case-level traces showing exactly what was attempted, skipped, or failed and why. The goal is to create a public quality bar for AI-driven blockchain actions, not just AI-generated explanations.

How it's Made

Fastapi backend, nextjs frontend, real testnet usage. Used openrouter, openai, and ollama for AI inference.

We built x402Bench as a full benchmark pipeline with a Node.js runner, a FastAPI control API, and a Next.js dashboard.
The core engine executes a versioned readiness suite (readiness_bench/suite.json) where each case defines expected policy outputs, execution mode (real vs decision_only), sponsor challenge mapping, and required documentation sources.

For each model, x402Bench runs two comparable modes: with official protocol docs context and without docs context. The evaluator parses structured JSON outputs and scores decision correctness (allow/block, approval requirement, priority, risk, control set quality), strict full-match rate, execution-gate eligibility, workflow pass rate on real cases, and latency.

On partner tech integration:

Hedera is used for testnet settlement paths and payment-flow validation.
Chainlink is used for workflow/orchestration style integration via webhook-based execution hooks.
Ledger is used as a policy/approval gate in high-trust payment scenarios.

The backend exposes run endpoints, idempotency handling, run-locking to prevent collisions, and report serving. Reports are written as both JSON and Markdown so results are machine-consumable and judge-readable.

A notable hackathon optimization was adding automatic local fallback wiring for integration endpoints in API-triggered runs, so the benchmark stays runnable end-to-end even when some explicit endpoints are not yet configured. We also pinned a live fallback dashboard snapshot to ensure the UI always opens with a valid latest leaderboard and case-level traces.

AI X402 Benchmark