Large multimodal models (LMMs) are used for processing and understanding multiple types of data modalities, such as text, images, audio, video, and potentially others3. They can integrate and interpret information from these different data sources, often simultaneously, and are considered more advanced versions of large language models (LLMs) that can work not only on text but diverse data types23. LMMs are considered the next step toward artificial general intelligence and have applications in tasks requiring understanding and integrating information across different types of data.
Current LMMs struggle with long video sequences due to the excessive number of visual tokens produced by vision encoders4. Models like LLaVA-1.6 generate 576 to 2880 visual tokens for a single image, which escalates significantly with more frames, creating a bottleneck in processing and understanding long video sequences.
The Long Context Transfer approach enhances large multimodal models' (LMMs) ability to understand long video sequences1. It extends the context length of the language model backbone, enabling it to process a significantly larger number of visual tokens without additional video training1. This method aligns the context-extended language model with visual inputs, allowing the model to process long videos effectively without added complexity1.