Attention Visualization of different layers on LLaVA-Video-7B. Reference Layer can focus on the query related visual regions and we use such score to select query related tokens. The Video Link is https://www.youtube.com/watch?v=fFjv93ACGo8. We present three queries:
We propose FlexSelect, a flexible and efficient token selection strategy for processing long videos. FlexSelect identifies and retains the most semantically relevant content by leveraging cross-modal attention patterns from a reference transformer layer. It comprises two key components: (1) a training-free token ranking pipeline that leverages faithful cross-modal attention weights to estimate each video token's importance, and (2) a rank-supervised lightweight selector that is trained to replicate these rankings and filter redundant tokens.
FlexSelect can be seamlessly integrated into various VideoLLM architectures, such as LLaVA-Video, InternVL and Qwen-VL, serving as a plug-and-play module to extend their temporal context length. Empirically, FlexSelect delivers strong gains across multiple long-video benchmarks – including VideoMME, MLVU, LongVB, and LVBench. Morever, it achieves significant speed-ups (e.g., up to 9× on LLaVA-Video-7B model), highlighting FlexSelect's promise for efficient long-form video understanding.
Figure 1: (left) Visualization of cross-modal attention maps of LLaVA-Video-7B across layers (user query: "what's the color of the cup?"). Attention scores progressively highlight the query-related regions (the cup) with layer depth, and this highlighting is most pronounced at the specific reference layer (layer 19 in example). FlexSelect employs attention scores from this layer to select semantically related visual tokens. (right) VideoMME accuracy and response time (time to generate the first token) of LLaVA-Video-7B. The original model with 64 input frames achieves limited accuracy 64.4% due to inadequate coverage for long video content, while increasing frames will overload the model's context window, reducing accuracy to 58.5% and slowing response time to 38.2s. FlexSelect improves this by filtering irrelevant tokens, achieving 68.9% accuracy at 512 frames with 9× faster response (4.2s).
Figure 2: FlexSelect Pipeline. Given a long video and a query, FlexSelect first partitions the video into frame sets and encodes each into visual tokens. For each set, a token selector identifies semantically relevant tokens by ranking cross-modal attention scores from a reference layer in a pre-trained VideoLLM or a lightweight selector network trained to approximate it. In this process, the projectors and text embeddings are employed to convert the visual tokens and user queries into tokens that match the dimension of subsequent transformer layers. After getting the scores, the top-ranked tokens across all segments are aggregated and projected into the decoder for final reasoning. FlexSelect operates in a training-free or rank-supervised mode, and serves as a plug-and-play module that enables efficient long-video understanding without requiring modifications to the base VideoLLM.
While the training-free approach described above effectively reduces computational overhead, it still relies on partial forward passes through the large VideoLLM to score visual tokens. To further enhance inference efficiency, we introduce a lightweight token selector trained via rank supervision to predict semantic relevance scores independently. The selector model is explicitly designed to replicate the token-ranking behavior observed at the reference transformer layer Lref.
Figure 3: Illustration of rank-supervised training We align lightweight model's predicted scores r̂ with the reference layer's semantic relevance scores rref by optimizing the Spearman rank correlation coefficient between them. Once trained, the ranking derived from these two scores will follow similar order, enabling the lightweight model to rank the visual tokens as the reference layer does and select the related tokens quickly.
Our Results
Table 1: Main Results. We denote the FlexSelect integrated with the lightweight token selector as FlexSelect-Lite. We conduct comprehensive evaluation on different long video benchmarks across diverse VideoLLMs. FlexSelect achieves SoTA results on VideoMME (74.4), MLVU (76.6), LongVideoBench (66.9), and LVBench (56.6) while reducing tokens by over 90\% (up to 9× speedup).
@misc{zhang2025flexselectflexibletokenselection,
title={FlexSelect: Flexible Token Selection for Efficient Long Video Understanding},
author={Yunzhu Zhang and Yu Lu and Tianyi Wang and Fengyun Rao and Yi Yang and Linchao Zhu},
year={2025},
eprint={2506.00993},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2506.00993},
}