Pass^1

= 0.800

A working customer-service agent for Sierra's own τ-bench airline domain. Built in two days. Stack: Claude Sonnet 4.5, Vercel AI SDK v6, Upstash, Next.js 16.

τ-bench is Sierra's open agent-reliability benchmark — seeded users, flights, reservations, and policies in a fixed domain. “Pass^1” is the probability the agent passes a single attempt; “pass^k” is the probability it passes every attempt across k i.i.d. trials. This page exposes 5 representative tasks against the airline domain. Try the chips below, or type your own request.

Pick an example above, or type your own request.

Itinerary will appear here as the agent works.

Methodology

The badge above is the aggregate pass^1 across the 5 canonical tasks listed below, n = 3 attempts per task, model claude-sonnet-4-5, measured May 28, 2026. Each trial spins up a fresh τ-bench corpus, replays via the same agent loop the chat panel uses, and scores on final database state — not on the prose of the assistant reply.

Sierra's published τ-bench leaderboard does not include a Sonnet 4.5 number; the widely-cited figure (pass^1 ≈ 0.700) is from third-party aggregators. The number above is our own measurement on these 5 tasks, not a re-run of Sierra's full 50-task suite. Eval script: scripts/sierra-eval.ts.

What the score actually means here. The τ-airline policy mandates explicit user confirmation before any mutating tool call. Our eval uses a multi-turn loop (up to 4 turns per trial) with a regex-based auto-confirmer that recognises common confirmation prompts and responds. The cancel chip passes 3/3 this way. The change-flight chip stays at 0/3 because Mia Li's prompt is genuinely ambiguous about which of her three reservations to modify — a heuristic auto-confirmer can't pick a candidate the way a real user would. The three non-mutating chips (refusal, lookup, contradiction handling) score 1.000 each: the agent stays on-policy in every trial. A full τ³-bench-style simulated-user LLM that holds task intent across turns is the right next step (v1.1) — it would lift the change-flight chip without changing the agent.

What's inside

Eight Zod-typed tools, in-memory τ-bench corpus per session, no real bookings, honest pass^1 scoring. Open-source companion repo at github.com/MaxHarar/sierra-tau-demo.

Day job: shipped a similar agent for medical demand packets at CURE — same kind of reliability problem, different domain.