
Existing open-source LMMs face several significant limitations. They often lack native integration and require adapters to align visual representations with pre-trained large language models (LLMs). Many LMMs are restricted to single-modal generation or rely on separate diffusion models for visual modeling and generation4. These limitations introduce complexity and inefficiency in both training and inference time.

LMMs typically handle visual representation integration by connecting a visual encoder and a large language model2. They take in a fixed and large amount of visual tokens, such as the penultimate layer features in the CLIP visual encoder, as the prefix content2. Recent LMMs incorporate more complex structures to improve integration.

The main focus of open-source LMMs like LLaVA and CogVLM is on multimodal understanding, which involves processing and interpreting information from multiple sensory modalities, such as vision and language. These models often rely on pre-trained LLMs as their backbone and may use additional diffusion models for vision generation. However, they lack native integration and do not typically support multimodal generation capabilities.