When Tokens Talk Too Much: A Guide to Compressing AI Inputs from Images, Videos, and Audio

Paper at a Glance

Paper Title: When Tokens Talk Too Much: A Survey of Multimodal Long-Context Token Compression across Images, Videos, and Audios
Authors: Kele Shao, Keda Tao, Kejia Zhang, Sicheng Feng, Mu Cai, Yuzhang Shang, Haoxuan You, Can Qin, Yang Sui, and Huan Wang.
Affiliation: A collaboration including researchers from Westlake University, Zhejiang University, Xiamen University, University of Wisconsin-Madison, Salesforce AI Research, and other institutions.
Published in: arXiv preprint, 2025
Link to Paper: https://arxiv.org/abs/2507.20198

The Gist of It: TL;DR

In one sentence: This paper provides the first systematic survey of token compression techniques for Multimodal Large Language Models (MLLMs), organizing the rapidly growing field by modality (image, video, audio) and underlying mechanism to tackle the massive computational cost of processing long-context inputs.

Why It Matters: The Big Picture

Multimodal Large Language Models (MLLMs) are becoming incredibly powerful, capable of understanding not just text but also images, videos, and audio. This opens up a world of possibilities, from analyzing hours of security footage to interpreting complex medical scans. But there’s a huge problem hiding in plain sight: these models choke on large inputs.

The culprit is the self-attention mechanism at the heart of the Transformer architecture, which has a computational complexity that scales quadratically with the number of input tokens. While a long text document might have a few thousand tokens, a single high-resolution image can have thousands, and a long video can generate tens of millions. As shown in Figure 1 of the paper, a 90-minute movie can be tokenized into over 54 million tokens—a number that would bring even the most powerful supercomputers to their knees. This “token explosion” is the single biggest bottleneck preventing MLLMs from being practically deployed for real-world, data-rich tasks.

This is where token compression comes in. By intelligently reducing the number of tokens fed to the model without losing critical information, we can make these powerful MLLMs faster, cheaper, and more accessible. This survey provides the first comprehensive map of this crucial research area.

The Core Idea: How It Works

This paper is a survey, so its core contribution is not a new method but a clear and insightful taxonomy that organizes the chaotic landscape of existing techniques. The authors argue that effective compression must account for the unique characteristics of each data type. They structure their survey around two key axes: the data modality and the compression mechanism.

1. The Problem They’re Solving

The central challenge is managing the sheer volume of tokens from multimodal data. Redundancy is everywhere, but it looks different in each modality:

Images: Spatial redundancy, where adjacent patches are often very similar (e.g., a large patch of blue sky).
Videos: Spatio-temporal redundancy, where consecutive frames share massive amounts of information (e.g., a static background).
Audio: Temporal and spectral redundancy, with periods of silence, background noise, or sustained notes.

A one-size-fits-all compression strategy won’t work. The goal is to prune this redundancy intelligently.

2. The Key Innovation: A Unified Taxonomy

The authors dissect the field into four fundamental compression mechanisms, providing a unified lens through which to view and compare different methods. This taxonomy, summarized in Table I of the paper, is the heart of their contribution.

3. The Methods, Categorized

The four primary mechanisms for token compression are:

Transformation-based: These methods transform tokens into a more compact form, often by changing their scale or representation. Think of it like resizing an image to be smaller. Common techniques include pooling (averaging a group of tokens) or convolution (using a learned filter to summarize a local neighborhood). This is simple and effective but often offers a fixed compression rate.
Similarity-based: These techniques identify and merge tokens that are semantically similar. The model finds redundant patches—say, ten patches of grass—and merges them into a single representative “grass” token. This is more flexible than transformation but risks over-generalizing and losing fine-grained details.
Attention-based: These methods leverage the model’s own attention mechanism to decide what’s important. Tokens that receive low attention scores from other tokens are deemed less critical and are pruned. This is a dynamic process where the model itself guides the compression, making it highly context-aware. Compression can happen early (in the vision encoder) or later (inside the LLM decoder).
Query-based: This is perhaps the most sophisticated approach. It uses the user’s text prompt (the “query”) to guide the compression. The model distills or selects only the multimodal tokens that are most relevant to the question being asked. For example, if you ask, “What color is the car?” the model can focus its resources on car-related tokens and aggressively compress the background.

This categorization (visualized beautifully in Figure 3) creates a powerful mental model for understanding how to make MLLMs more efficient, whether you’re working with images, videos, or audio.

Key Insights from the Survey

Since this is a survey, the key takeaways are not experimental results but rather the trends and challenges the authors identified in the field.

Modality Matters: The survey confirms that redundancy is modality-specific. The most effective video compression methods, for instance, must tackle temporal redundancy across frames, a problem that doesn’t exist for static images.
The Pruning Location Trade-off: There’s a critical trade-off depending on where you compress tokens. Pruning early in the pipeline (e.g., in the visual encoder) saves the most computation but risks discarding information before the LLM can even see it. Pruning later (inside the LLM) is safer and more context-aware but less computationally efficient.
The Deployment Wall: A major finding is the practical incompatibility between many attention-based pruning methods and highly optimized libraries like FlashAttention. These libraries speed up computation by never explicitly forming the full attention matrix, meaning the attention scores needed for pruning are inaccessible. This is a significant hurdle for real-world deployment.
No Free Lunch with Combining Methods: The authors point out that simply stacking different compression techniques often doesn’t work. Combining methods that target different types of redundancy can lead to conflicting pruning decisions and may even degrade performance.

A Critical Look: Strengths & Limitations

Strengths / Contributions

First-of-its-Kind and Comprehensive: This is the first systematic survey on a critical and rapidly growing topic. Its comprehensive scope across images, videos, and audio provides a much-needed consolidation of a fragmented research area.
Excellent Taxonomy: The categorization by both modality and underlying mechanism is intuitive, logical, and highly useful. It serves as an excellent framework for researchers to position their own work and understand the landscape.
Highlights Practical Challenges: The paper goes beyond a simple literature review to identify crucial, real-world challenges, such as the deployment friction with modern acceleration libraries and the lack of robust, standardized benchmarks.

Limitations / Open Questions

A High-Level Map: As a survey, it provides a breadth-first overview. It can’t delve into the fine-grained details of why a specific method outperforms another. Researchers will still need to read the original papers for in-depth understanding.
A Fast-Moving Target: The primary weakness of any survey in AI is that the field moves at lightning speed. New methods will emerge, and the landscape will continue to shift, potentially challenging the boundaries of the proposed taxonomy.
Lack of Direct Comparative Analysis: The paper presents results from various studies, but these cannot be directly compared due to differing experimental setups. The field still lacks a unified benchmark for a true apples-to-apples comparison of compression techniques, a gap this survey highlights but does not fill.

Contribution Level: Foundational Contribution. While it doesn’t introduce a new algorithm, this survey provides an essential service to the research community. By organizing, structuring, and synthesizing existing knowledge, it creates a clear roadmap for future innovation in efficient MLLMs. It is an invaluable resource for anyone entering or working in this domain.

Conclusion: Potential Impact

“When Tokens Talk Too Much” is more than just an academic overview; it’s a call to action. It clearly articulates that token compression is not merely an optimization for efficiency but a fundamental enabler for the next generation of MLLMs. Without these techniques, processing the vast, messy, and long-form data of the real world will remain out of reach.

The insights from this survey will guide future research toward creating more unified compression frameworks that work across modalities and designing new model architectures that are inherently token-efficient from the ground up. As these methods mature, they will unlock applications in autonomous driving, real-time medical analysis, and intelligent agents that were previously computationally unimaginable.