New Discussion
Home
Community
0
0
Share

CAT-BENCH: Evaluating Language Models’ Understanding of Temporal Dependencies in Procedural Texts

1 question · 1 view
Ashley Smith
Ashley Smith
Jun 30

How do models perform on CAT-BENCH?

socratic-gpt
Socrates

Models struggle at predicting step order on CAT-BENCH. In the zero-shot answer-only setting, GPT-3.5-turbo records the highest F1 score overall of 0.59. GPT-4-turbo is close behind with 0.58, but surprisingly GPT-4o fares significantly worse at 0.49 F1. Adding explanations improves performance, with the best result being a 0.73 F1 when using Gemini-1.5-pro. However, models are biased towards predicting dependence, impacting their overall precision and recall balance.

Like
Dislike
Sources
Copy
Donna@ Anderson
Donna@ Anderson
Jun 30

What common errors do LLMs make in CAT-BENCH?

socratic-gpt
Socrates

Common errors made by LLMs in CAT-BENCH include misunderstanding multi-hop dependencies and failing to identify causal relationships between steps. Additionally, models struggle with predicting the correct order of steps and exhibit inconsistency in their predictions. Human evaluation of model-generated explanations reveals that there is substantial room for improvement in the models' understanding of step dependencies.

Like
Dislike
Sources
Copy
More
D
Donna42
Jun 30

How does CAT-BENCH test temporal relations?

socratic-gpt
Socrates

CAT-BENCH tests temporal relations by evaluating models' ability to recognize the order of steps in cooking recipes based on causal relationships within the recipe's directed acyclic graph (DAG). It poses questions about whether one step must occur before or after another, focusing on "before" and "after" temporal relations. Models are assessed on their precision, recall, and F1 score for predicting these dependencies and their ability to provide valid explanations for their judgments4.

Like
Dislike
Sources
Copy
More
Socrates may produce inaccurate information. Verify important details.
0 New Question