Pass^1

= 0.800

A working customer-service agent for Sierra's own τ-bench airline domain. Built in two days. Stack: Claude Sonnet 4.5, Vercel AI SDK v6, Upstash, Next.js 16.

τ-bench is Sierra's open agent-reliability benchmark — seeded users, flights, reservations, and policies in a fixed domain. “Pass^1” is the probability the agent passes a single attempt; “pass^k” is the probability it passes every attempt across k i.i.d. trials. This page exposes 5 representative tasks against the airline domain. Try the chips below, or type your own request.

Methodology

The badge above is the aggregate pass^1 across the 5 canonical tasks listed below, n = 3 attempts per task, model claude-sonnet-4-5, measured May 28, 2026. Each trial spins up a fresh τ-bench corpus, replays via the same agent loop the chat panel uses, and scores on final database state — not on the prose of the assistant reply.

Change my flight to LAX0.000 (0/3)
Cancel my reservation1.000 (3/3)
Bump me to first class, free1.000 (3/3)
What's my flight status?1.000 (3/3)
Refund me but keep the seat1.000 (3/3)

Sierra's published τ-bench leaderboard does not include a Sonnet 4.5 number; the widely-cited figure (pass^1 ≈ 0.700) is from third-party aggregators. The number above is our own measurement on these 5 tasks, not a re-run of Sierra's full 50-task suite. Eval script: scripts/sierra-eval.ts.

What the score actually means here. The τ-airline policy mandates explicit user confirmation before any mutating tool call. Our eval uses a multi-turn loop (up to 4 turns per trial) with a regex-based auto-confirmer that recognises common confirmation prompts and responds. The cancel chip passes 3/3 this way. The change-flight chip stays at 0/3 because Mia Li's prompt is genuinely ambiguous about which of her three reservations to modify — a heuristic auto-confirmer can't pick a candidate the way a real user would. The three non-mutating chips (refusal, lookup, contradiction handling) score 1.000 each: the agent stays on-policy in every trial. A full τ³-bench-style simulated-user LLM that holds task intent across turns is the right next step (v1.1) — it would lift the change-flight chip without changing the agent.

What's inside

Eight Zod-typed tools, in-memory τ-bench corpus per session, no real bookings, honest pass^1 scoring. Open-source companion repo at github.com/MaxHarar/sierra-tau-demo.

Day job: shipped a similar agent for medical demand packets at CURE — same kind of reliability problem, different domain.