MG-LLaVA: An Advanced Multi-Modal Model Adept at Processing Visual Inputs of Multiple Granularities, Including Object-Level Features, Original-Resolution Images, and High-Resolution Data
What are Multi-modal Large Language Models used for?

Multi-modal Large Language Models (MLLMs) are used for various applications in visual tasks, such as understanding and generating content across a spectrum of formats including text, images, audio, and video. They can process and interpret information from different data sources, often simultaneously, and are considered more advanced versions of large language models (LLMs) that can work not only on text but also diverse data types.
How do MLLMs process low-resolution images?

MLLMs process low-resolution images by extracting visual features from the limited pixel information available. However, this often leads to less accurate identification of objects, scenes, or actions in the image due to the reduced amount of detail. To address this, researchers have proposed enhancements such as training on diverse datasets and using high-resolution images, but challenges remain in capturing fine-grained details and recognizing small objects in complex images.
What limitations do current MLLMs face with low-resolution inputs?

Current MLLMs face limitations with low-resolution inputs due to the reduced amount of information available for processing. This can lead to inaccuracies in identifying objects, scenes, or actions within the image, as well as difficulty in recognizing small objects and processing fine-grained details. These limitations affect the overall effectiveness of MLLMs in visual tasks.