VideoMLA: Low-Rank Latent KV Cache for Minute-Scale Autoregressive Video Diffusion

🎯 The Core Thesis

The authors address the critical memory bottleneck in autoregressive video generation, specifically the exponential growth of the Key-Value (KV) cache during the synthesis of long, minute-scale videos. The core thesis is that the high-dimensional KV cache in video diffusion models contains significant redundancy and can be effectively compressed into a low-rank latent representation without sacrificing temporal consistency or visual quality.

💡 The Innovation

The paper introduces VideoMLA (Multi-Head Latent Attention), a novel architecture that implements a low-rank latent KV cache. Unlike standard Multi-Head Attention (MHA) or Grouped-Query Attention (GQA), VideoMLA compresses the KV tensors into a compact latent space. This compression is governed by a learnable projection matrix that captures the most salient spatio-temporal features. By operating on these latent projections, the model can maintain a vast “memory” of previous frames (enabling minute-scale generation) while keeping the hardware memory footprint constant and manageable.

📈 Key Results

VideoMLA demonstrates breakthrough results in long-form video synthesis:

Memory Reduction: The KV cache size was reduced by up to 90% compared to standard autoregressive diffusion baselines, enabling the generation of videos exceeding 60 seconds on consumer-grade GPUs.
Temporal Stability: Unlike previous pruning or window-based attention methods, VideoMLA maintained global coherence, avoiding the “drift” or quality degradation typical in long-sequence generation.
Inference Speed: The low-rank operations significantly reduced the latency of calculating attention weights, leading to a measurable increase in tokens-per-second during the video diffusion process.

🌍 Implications

This innovation paves the way for the creation of AI-generated short films and long-form visual storytelling. By solving the KV cache explosion, the research moves autoregressive video models closer to being commercially viable for high-resolution, long-duration content. It also suggests that “compressed memory” is a viable path for other modalities, such as long-context LLMs or high-resolution 3D scene synthesis, where memory overhead is the primary limiting factor.

⚖️ Verdict

A technically sophisticated and highly practical advancement. VideoMLA solves a tangible engineering problem (memory limits) with an elegant mathematical approach (low-rank latent projection). The result is a scalable framework that makes minute-scale video generation a reality, making this a pivotal paper for the future of generative video.