Decomposed Attention Fusion in MLLMs for
Training-Free Video Reasoning Segmentation
Su Ho Han1* Jeongseok Hyun1* Pilhyeon Lee2 Minho Shim3 Dongyoon Wee3 Seon Joo Kim1
1Yonsei University 2Inha University 3NAVER Cloud
(*Equal contribution)
🚀 Double the Speed, Zero Training: The Free Lunch for Video LLMs ⚡
Overall comparison of training-free token reduction methods using LLaVA-Video-7B under 50% and 30% pre-filling token budgets. Query-agnostic methods support KV-cache reuse, whereas others require re-computation for each new query. We evaluate three types of video QA datasets: short videos (<3 min), long videos (<1 hour), and needle-in-a-haystack (NIAH). (a, b): Per-dataset accuracy. (c, d): Average results across all datasets.
Video large language models (LLMs) achieve strong video understanding by leveraging a large number of spatio-temporal tokens, but suffer from quadratic computational scaling with token count. To address this, we propose a training-free spatio-temporal token merging method, named STTM. Our key insight is to exploit local spatial and temporal redundancy in video data which has been overlooked in prior work. STTM first transforms each frame into multi-granular spatial tokens using a coarse-to-fine search over a quadtree structure, then performs directed pairwise merging across the temporal dimension. This decomposed merging approach outperforms existing token reduction methods across six video QA benchmarks. Notably, STTM achieves a 2× speed-up with only a 0.5% accuracy drop under a 50% token budget, and a 3× speed-up with just a 2% drop under a 30% budget. Moreover, STTM is query-agnostic, allowing KV cache reuse across different questions for the same video.
Overview of the STTM. (Left) Our spatio-temporal token merging method is a training-free, plug-and-play module that produces spatio-temporally multi-granular tokens. (Middle) In step 1, tokens are merged based on spatial locality, where similar tokens within a 2D grid are combined into a single token. (Right) In step 2, spatially multi-granular tokens are further merged along the temporal dimension, where similar tokens across frames are consolidated into their earliest occurrence. The arrows indicate the direction of token merging. The green lines indicate merging over one timestep, and magenta lines are merging over two timesteps. The scale and the number of tokens are set for illustration.
Spatial Merging. A coarse-to-fine spatial search is performed using a quadtree structure. If all four fine child nodes exhibit high similarity with the coarse parent node, the search process terminates, and the parent node is used to represent the corresponding region. Otherwise, the search continues until the finest level is reached. Here, the scale for each level is an example for illustration.
Main. Comparison of training-free token reduction methods using LLaVA-Video-7B under 50% and 30% pre-filling token budgets. Token-reduced results are reported relative to the result with 100%.
Other MLLMs (1). Comparison of training-free token reduction methods using LLaVA-OneVision-7B. Relative to 100% result.
Other MLLMs (2). Comparison of training-free token reduction methods using Qwen2VL-7B. Relative to 100% result.
Large MLLM. Comparison using LLaVA-Video-72B. Relative to 100% result.
Trade-off of accuracy and visual token retention ratio.
Visualization of spatial token merging results. Each image patch within a green box represents a single token.
Visualization of spatio-temporal token merging results. (a) The first eight consecutive frames are sampled. (b) Intermediate frames are sampled for illustration purpose. Empty regions indicate areas that have been merged with early tokens.
Demo video will be released soon. Stay tuned!
@article{hyun2025multi,
title={Multi-Granular Spatio-Temporal Token Merging for Training-Free Acceleration of Video LLMs},
author={Hyun, Jeongseok and Hwang, Sukjun and Han, Su Ho and Kim, Taeoh and Lee, Inwoong and Wee, Dongyoon and Lee, Joon-Young and Kim, Seon Joo and Shim, Minho},
journal={arXiv preprint arXiv:2507.07990},
year={2025}
}