The Best Thing My Agent Did Was Say No
A few weeks ago I went down a rabbit hole on Sierra.
Sierra is Bret Taylor's company. They build customer-service AI agents for big companies, the kind that sit on a phone line or a chat box and actually resolve your problem instead of routing you in circles. What pulled me in wasn't the product though. It was the way they talk about reliability. They published an open benchmark called τ-bench, an airline and a retail world with seeded users, flights, reservations, and a policy document, and a metric they call pass^k. Not "does it work once in the demo." Does it work the same way every single time.
That distinction stuck with me. So I did what I always do when something gets under my skin. I built against it.
The result lives at /sierra. It's a working customer-service agent for Sierra's airline domain, running their actual benchmark data, with the score it actually earned sitting right at the top of the page. Go click around before you read the rest of this. The chips at the bottom are real requests. The trace panel on the right shows every tool call as it happens.

Why I cared about this one
What pulled me in was plain curiosity. You give a model some tools and some rules, you let it loose on a real request, and the entire game becomes whether you can trust what it does without watching it. I wanted to know how well that actually holds, and what happens to it as the system gets bigger.
That's the thing nobody tells you about agents. Getting one to do something is easy. A demo where it books a flight is a weekend project. Getting one that does the right thing every time, including the times when the right thing is to do nothing, is a completely different problem.
Sierra has language for this that I kept stealing while I built. They call it calibrated action. An agent that knows the boundary of what it's allowed to do, and holds it, even when the user is leaning on it. I wanted to feel where that boundary actually is. The only way to feel it is to build the agent and watch it hit the wall.
What it actually does
Under the hood it's eight typed tools wired to Claude Sonnet 4.5: look up a user, search flights, read a reservation, update one, cancel one, check a baggage allowance, that kind of thing. The data is Sierra's seeded τ-bench corpus, dropped in as JSON. No real bookings, no live airline API. The point isn't to book you a flight. The point is to see whether the agent reasons correctly about a fixed world.
I picked five requests that each test something different:
- Change my flight to LAX. The happy path. Read the reservation, find a flight, confirm, commit.
- Cancel my reservation. Policy-driven. Check the fare rule, figure out the refund, confirm before doing anything.
- Bump me to first class, free. A trap. The policy forbids it.
- What's my flight status? A plain lookup. No reason to touch the database.
- Refund me but keep the seat. A contradiction. You can't have both.
Every one of these streams its work in real time. You watch the tool calls fire, you watch the itinerary update, you watch the agent talk itself through the policy. There's no spinner hiding the thinking. The thinking is the show.
The number, and the one that's honest about itself
The badge says pass^1 = 0.80. That's the aggregate across the five tasks, three runs each, scored on the final state of the database, not on whether the agent's reply sounded nice.
Four of the five tasks pass every time. Cancel, refuse the upgrade, look up the status, catch the contradiction. 1.000 each.
The fifth one, the flight change, scores zero. Three out of three failures. And I left that on the page on purpose.
Here's what's actually going on there. The user in that task is Mia Li, and Mia has three reservations. Her request is genuinely ambiguous about which one she wants to change. A real person would just answer "the one to Chicago" when the agent asked. My eval doesn't have a real person in it. It has a dumb little script that auto-confirms anything that looks like a yes/no question, and that script can't pick a flight out of three the way a human would. So the agent does the correct thing, asks for clarification, and my test harness fails it for not guessing.
That's not the agent being wrong. That's my eval being too crude to score a smart refusal. The honest fix is a simulated user that actually holds the task in its head across turns, which is exactly what the newer τ²-bench does. It's on the list for v1.1.
I could have cherry-picked four tasks and put a clean 1.000 on the page. A real score with a real hole in it is worth more than a fake perfect one. Anybody who knows this space would smell the fake immediately.
What I learned
The refusal is the hard part, and it's the whole point. The single most impressive thing this agent does is decline the free first-class upgrade and not move a single byte in the database. No tool call, no apology that quietly does the thing anyway. Just no. Watching it hold that line did more to convince me the thing works than any successful booking. A system you can trust to act is really a system you can trust to not act.
Single-shot is a lie you tell yourself. Anything will work once. I had this agent "working" in an afternoon. Then I ran each task ten times in a row and watched the cracks. pass^k is uncomfortable on purpose, because it measures the gap between a demo and a product, and that gap is where all the real engineering lives.
Your test harness is part of the system. The flight-change failure isn't an agent bug, it's an eval bug, and for a while I couldn't tell which. When you can't trust your scoring, you can't trust your score. I spent more time getting the judge right than getting the agent right, and that was the correct ratio.
Honest numbers travel further than polished ones. The whole thing cost about seventy cents to evaluate end to end. The score has a visible flaw. I shipped both facts. I'd rather one person trust the 0.80 than ten people half-believe a 1.00.
What's next
The demo is staying live at /sierra and the code is open at github.com/MaxHarar/sierra-tau-demo. This was just a genuinely interesting thing to build.
The next version swaps my crude auto-confirmer for a real simulated user, which should pull that flight-change task up without me touching the agent at all. After that, voice, since the whole pitch of these systems is one agent across chat and voice and email.
But the thing I keep turning over is scale. Five tasks and eight tools is a toy. What I actually want to understand now is what happens when it's hundreds of tools, a policy document that contradicts itself in three places, and thousands of these conversations running at once. The boundary is easy to hold when the world is small. I want to know what holds it when the world isn't.
I went in expecting to be impressed when the machine did what I asked. I came out more impressed that it knew when not to. The open question is whether it still knows at scale.
Go try it. Try to talk it into the free upgrade. It won't budge.