Is Scaling the Only Path to AI Supremacy? This AI Paper Unveils ‘Phantom of Latent for Large Language and Vision Models

Large language and vision models (LLVMs) face a critical challenge in balancing performance improvements with computational efficiency. As models grow in size, reaching up to 80B parameters, they deliver impressive results but require massive hardware resources for training and inference. This issue becomes even more pressing for real-time applications, such as augmented reality (AR), where deploying these large models on devices with limited resources, like mobile phones, is nearly impossible. Overcoming this challenge is essential for enabling LLVMs to function efficiently across various fields without the high computational costs traditionally associated with larger models.

Existing methods to improve the performance of LLVMs typically involve scaling up model size, curating larger datasets, and incorporating additional modules for enhanced vision-language understanding. While these approaches improve accuracy, they impose significant computational burdens, requiring high-end GPUs and substantial VRAM for training and inference. This makes them impractical for real-time applications and resource-limited environments. Additionally, integrating external vision modules adds complexity, further limiting their usability in on-device applications.

The researchers from KAIST propose the Phantom LLVM family, which includes models ranging from 0.5B to 7B parameters. Phantom enhances learning capabilities by temporarily increasing the latent hidden dimension during multi-head self-attention (MHSA), a feature termed “Phantom Dimension.” This innovation allows the model to embed significantly more vision-language knowledge without a permanent increase in model size. Phantom Optimization (PO) is also introduced, combining autoregressive supervised fine-tuning (SFT) with a direct preference optimization (DPO)-like approach to minimize errors and ambiguities in outputs. This approach significantly improves computational efficiency while maintaining high performance.

The Phantom models employ the InternViT-300M as a vision encoder, which aligns text-to-image representations through contrastive learning. The vision projector, constructed using two fully connected layers, adapts the hidden dimension to the corresponding multimodal LLM’s latent space. A core aspect of Phantom is the temporary enlargement of the latent hidden dimension during MHSA, which enhances the model’s ability to embed vision-language knowledge without increasing its physical size. The models are trained using a dataset of 2.8M visual instruction samples, curated into 2M Phantom triples (questions, correct answers, and incorrect or ambiguous answers). These triples play a crucial role in training through PO, improving response accuracy by eliminating confusion.

Phantom exhibits strong performance improvements across several benchmarks, outperforming many larger models in tasks involving image understanding, chart interpretation, and mathematical reasoning. For instance, in benchmarks like SQAI and ChartQA, Phantom’s accuracy exceeds that of larger models such as Cambrian-1-13B and SPHINX-MoE-7B×8. These results demonstrate Phantom’s capability to handle complex vision-language tasks efficiently, all while using a smaller model size. This efficiency is largely due to Phantom Dimension and Phantom Optimization, which allow the model to maximize learning without a proportional increase in computational requirements.

The Phantom LLVM family introduces a new approach to addressing the challenge of balancing performance and computational efficiency in large vision-language models. Through the innovative use of Phantom Dimension and Phantom Optimization, Phantom enables smaller models to perform at the level of much larger models, reducing the computational burden and making these models feasible for deployment in resource-constrained environments. This innovation has the potential to expand the application of AI models across a broader range of real-world scenarios.


Check out the Paper and GitHub. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 50k+ ML SubReddit

Aswin AK is a consulting intern at MarkTechPost. He is pursuing his Dual Degree at the Indian Institute of Technology, Kharagpur. He is passionate about data science and machine learning, bringing a strong academic background and hands-on experience in solving real-life cross-domain challenges.

🧵🧵 [Download] Evaluation of Large Language Model Vulnerabilities Report (Promoted)