LLaVolta: Efficient Multi-modal Models via Stage-wise Visual Context Compression

Abstract

While significant advancements have been made in compressed representations for text embeddings in large language models (LLMs), the compression of visual tokens in large multi-modal models (LMMs) has remained a largely overlooked area. In this work, we present the study on the analysis of redundancy concerning visual tokens and efficient training within these models. Our initial experiments show that eliminating up to 70% of visual tokens at the testing stage by simply average pooling only leads to a minimal 3% reduction in visual question answering accuracy on the GQA benchmark, indicating significant redundancy in visual context. Addressing this, we introduce Visual Context Compressor, which reduces the number of visual tokens during training to enhance training efficiency without sacrificing performance. To minimize information loss caused by the compression on visual tokens while maintaining training efficiency, we develop LLaVolta as a lite training scheme. LLaVolta incorporates stage-wise visual context compression to progressively compress the visual tokens from heavily to lightly, and finally no compression at the end of training, yielding no loss of information when testing. Extensive experiments demonstrate that our approach enhances the performance of MLLMs in both image-language and video-language understanding, while also significantly cutting training costs.

Visual tokens are redundant in MLLMs.

Left: The accuracy of the LLaVA-1.5-7B model on the GQA benchmarks varies with different percentages of retained visual tokens. The x-axis represents the percentage of original visual tokens preserved after applying 1D average pooling with varying stride sizes S applied in i-th Transformer layer. Right: Visual tokens receive less attention from the [ANS] token as we go deeper into its layers of LLaVA-1.5-7B model. These findings collectively suggest a significant redundancy within the visual tokens of the MLLMs.

Visual Context Compressor

The visual context compressor is instantialized as an average pooler applied to the visual tokens in k-th Transformer layer of an LLM. Formally, given the hidden visual tokens at k-th Transformer layer H_k ∈ ℝ^B×C×L, the compressor is expected to fulfill the following projection: f: ℝ^B×C×L ↦ ℝ^B×C×L_out, which results in compressed visual tokens: H_k ∈ ℝ^B×C×L_out, where L_out = L / S with s as the compression stride.

Compression Ratio (CR): For an LLM with N Transformer decoder layers, the compression ratio for visual tokens can be calculated as:

CR = ^{N ⋅ L}⁄_{((N-K) ⋅ L_out + K ⋅ L)}

where K is the K-th Transformer layer of a multi-modal LLM; L is the length of visual tokens input into Visual Context Compressor; L_out is the compressed length of visual tokens generated by Visual Context Compressor.

LLaVolta: A New Training Scheme

As depicted in the first figure, we primarily explore a three-stage training pipeline that progressively reduces the compression ratio.

Stage I: Heavy Compression: The MLLM training at the first one-third of the total training iterations commences with a heavy compression ratio (> 500%), where Visual Context Compressor is applied in an early layer of the LLM with a large pooling stride. This setup enables a very fast training speed.

Stage II: Light Compression: The MLLM continues training with another one-third of the total training epochs. At this stage, Visual Context Compressor is applied at only the deeper layers of the LLM with a smaller pooling stride compared to Training Stage I.

Stage III: No Compression. The MLLM continues training with the final one-third of the total training epochs, following the standard MLLM training protocol without compression. Disabling compression in the final stage ensures that the number of tokens remains consistent with the original MLLM during inference, avoiding the loss of information caused by the reduction of visual tokens.

Instantiation of LLaVolta

A family of training schemes are instantiated heree. The single-stage (non-compression) scheme is equivalent to the MLLM baseline (LLaVA). For multi-stage training, the compression stage can either go deeper or wider. "deeper" implies an increase in K (Transformer layer), while "wider" means a decrease in the stride of the pooler.

Towards Efficient MLLM

We conduct a thorough evaluation of the multi-modal capability of LLaVolta based on LLaVA across 13 benchmarks. The following Table demonstrates that our proposed LLaVolta not only consistently lowers training costs by 16% (15.3 hours vs 12.8 hours) but also surpasses the non-compression baseline. The four-stage training schemes achieves the best performance in nine out of the thirteen benchmarks and obtains 61.9% average performance, improving LLaVA-v1.5-7B with much less overall TFLOPs and training time. This indicates the necessity of designing an optimally lite training scheme.

Towards Efficient Video MLLM

We extend our training scheme to VideoLLaVA and evaluate the results on three video-language understanding benchmarks. LLoVolta outperforms Video-LLaVA while reducing 9% training time.

BibTeX

@inproceedings{chen2024llavolta,
      title={LLaVolta: Efficient Multi-modal Models via Stage-wise Visual Context Compression},
      author={Chen, Jieneng and Ye, Luoxin and He, Ju and Wang, Zhao-Yang and Khashabi, Daniel and Yuille, Alan},
      journal={arXiv preprint arXiv:2406.20092},
      year={2024}
    }