Sierra's new benchmark reveals how well AI agents perform at real work
What is TAU-bench designed to evaluate?

TAU-bench is designed to evaluate the performance and reliability of conversational AI agents in real-world settings. It tests agents on completing complex tasks while interacting with simulated users and tools to gather required information, focusing on their ability to follow rules, reason, retain information, and communicate effectively in realistic conversations4.
Who created the AI startup Sierra?

Sierra, an AI startup focused on building conversational AI chatbots for businesses, was co-founded by Bret Taylor and Clay Bavor4. Bret Taylor is known for his work at Facebook, Salesforce, and OpenAI, while Clay Bavor is a veteran from Google where he led Google Labs and initiated Google's AR/VR effort, Project Starline, and Google Lens5.
What are the three requirements identified for TAU-bench?

The three requirements identified for TAU-bench are: 1) agents must interact seamlessly with humans and programmatic APIs for a long period of time to gather information and solve complex problems, 2) agents must accurately follow complex policies or rules specific to the task, and 3) agents must be consistent and reliable at scale4.