
The focus of the Auburn University study is to assess the visual abilities of large language models (LLMs) with vision capabilities (VLMs). The researchers found that while these models have developed camera capabilities to capture visual data, their ability to process and understand the data is still limited, especially when dealing with tasks such as counting or recognizing complex spatial relationships.

The researchers tested four popular VLMs: GPT-4o, Gemini-1.5 Pro, Claude-3 Sonnet, and Claude-3.5 Sonnet.

The research on the visual abilities of language models was posted to the arXiv preprint server. The study was conducted by a team of computer scientists from Auburn University in the U.S. and the University of Alberta in Canada.