
LLMs struggle with unfamiliar tasks because they rely heavily on pattern recognition and memorization from their training data. When faced with unfamiliar scenarios, they lack the ability to generalize their knowledge and perform robust reasoning, leading to poor performance and difficulty in adapting to new situations.

The MIT CSAIL study focuses on examining the reasoning skills and capabilities of large language models (LLMs) in various tasks and counterfactual scenarios. The research reveals that these models often struggle with generalizing their abilities to unfamiliar situations, indicating that their high performance is mostly limited to common task variants and may be attributed to overfitting or memorization of training data.

Researchers tested LLMs' reasoning abilities by comparing their performance on "default tasks" and "counterfactual scenarios." They used various datasets and benchmarks tailored to different aspects of the models' capabilities, such as arithmetic, chess, evaluating code, and answering logical questions. The study revealed that LLMs' reasoning abilities are often overestimated, as they struggle with generalizing to unfamiliar situations and mostly rely on memorization or overfitting from their training data.