
The primary method by which most Language and Multimodal Models (LMMs) integrate vision and language is by converting images into visual tokens which are then fed as sequences into Large Language Models (LLMs). This approach allows the models to process and understand multimodal inputs, combining textual and visual information. However, this method can significantly increase memory and computation demands, especially when dealing with high-resolution images or videos.

The "DeepStack" architecture developed by Fudan University and Microsoft differs from traditional LMM architectures in the way it integrates visual tokens into the language model. Traditional LMMs convert images into visual tokens and feed them as a long sequence into the first layer of the language model, which significantly increases memory and computation demands1. In contrast, DeepStack distributes these visual tokens across multiple layers of the model, aligning each group with a corresponding layer. This bottom-to-top approach enhances the model's ability to process complex visual inputs without increasing computational costs. DeepStack also uses a dual-stream approach to incorporate fine-grained visual details without increasing context length, dividing image processing into a global view stream and a high-resolution stream that adds detailed image features across LLM layers.

The key challenges associated with the traditional method of integrating visual information into Large Multimodal Models (LMMs) are as follows:
Increased Memory and Computation Demands: Converting images into visual tokens and feeding them as sequences into LMMs significantly increases memory and computation demands, especially with high-resolution photos or videos.
Compromise on Detailed Visual Information: Techniques like spatial grouping and token compression, which aim to reduce the number of visual tokens, often compromise on detailed visual information. This can impact the model's ability to understand and process complex visual inputs.
Processing Overhead: The fundamental approach of transforming visual tokens into a 1D sequence and inputting them into LMMs inherently increases processing overhead. This is particularly challenging when handling high-resolution images and complex visual inputs.
These challenges highlight the need for more efficient and effective methods to integrate visual information into LMMs, such as the DeepStack architecture.