τ-bench: A New Benchmark to Evaluate AI Agents’ Performance and Reliability in Real-World Settings with Dynamic User and Tool Interaction
How does τ-bench simulate real-world interactions?

τ-bench simulates real-world interactions by emulating dynamic conversations between a language agent and a simulated human user, incorporating domain-specific APIs and policy guidelines1. It evaluates an agent's ability to interact consistently and reliably, comparing the final database state after a conversation to the expected goal state. The framework includes diverse databases, APIs, and user simulations to test agents' capabilities in retail and airline domains, emphasizing complex, open-ended tasks and consistent rule-following5.
What domains did τ-bench specifically test language agents in?

τ-bench specifically tested language agents in two domains: retail and airlines. These domains were chosen for their balance between ease of data synthesis and policy specification, and their potential for diverse, realistic applications.
What were the main findings from τ-bench's evaluation of GPT-4?

τ-bench evaluated GPT-4's performance in dynamic conversations with a simulated human user, incorporating domain-specific APIs and policy guidelines. It found that despite being the best-performing model, GPT-4 succeeded in less than 50% of tasks and exhibited inconsistent behavior across trials. Challenges included complex tasks such as database reasoning, following domain-specific rules, and handling compound requests.