MicroTwins: Streamlined AI Framework Amplifying Vision Language Models (VLMs) Across Multiple Modalities

Vision Language Models (VLMs) come to life through a unique fusion of Computer Vision (CV) and Natural Language Processing (NLP), aiming to replicate human-like comprehension by intertwining images and words. This convergence poses a complex puzzle captivating researchers globally.


Recent strides introduce novel models like LLaVA and BLIP-2, leveraging vast sets of image-text pairs for precise cross-modal alignment. Innovations like LLaVA-Next and Otter-HD focus on refining image resolution and token quality within VLMs, tackling computational hurdles of processing high-res images. Additionally, techniques like InternLM-XComposer and token prediction, as seen in EMU and SEED, enable direct image decoding, albeit with latency and resource challenges.


MicroTwins: Streamlined AI Framework Amplifying Vision Language Models (VLMs) Across Multiple Modalities


A fresh approach emerges from the Chinese University of Hong Kong and SmartMore: Mini-Gemini. This framework revolutionizes VLMs by enhancing multi-modal input processing, employing a dual-encoder setup and novel patch info mining, fueled by a meticulously curated dataset. Mini-Gemini stands out by effectively handling high-res images and producing rich visual-textual content.


Mini-Gemini's methodology employs a dual-encoder system with a refined convolutional neural network for image processing, coupled with patch info mining for detailed visual cues. Trained on a composite dataset, it enhances model performance and adaptability across various Large Language Models (LLMs), supporting efficient any-to-any inference and excelling in zero-shot benchmarks.


In assessments, Mini-Gemini showcases superior performance, surpassing established models like Gemini Pro and LLaVA-1.5 in zero-shot benchmarks like MM-Vet, MMBench, and VQAT. These results underscore Mini-Gemini's prowess in handling intricate multi-modal tasks with precision.


In conclusion, Mini-Gemini represents a significant leap in VLMs, leveraging a dual-encoder system and patch info mining to outperform existing models. While acknowledging areas for enhancement in visual comprehension and reasoning, the research signifies a promising stride towards advanced multi-modal AI capabilities.