Efficient Sparse Attention via Dynamic Token Pruning in Multi-Modal Models

• Original White Paper (PDF)

1|# Efficient Sparse Attention via Dynamic Token Pruning in Multi-Modal Models 2| 3|Core Thesis 4|The paper proposes “Aperture-to-Aperture” (A2A), a self-contrastive test-time training framework designed for one-shot ultrasound image denoising. The central premise is that in Synthetic Aperture Ultrasound (SAU), multiple sub-aperture images share the same underlying anatomical structure but differ in their noise patterns. By disentangling these components in a high-dimensional latent space, the system can reconstruct a clean image from a single noisy sample without prior training on labeled data. 5| 6|Innovation 7|The key innovation is the “Pyramid Self-contrastive Learning” framework. A2A uses three integrated modules—an anatomy encoder, a noise encoder, and a decoder—to separate anatomical similarity and noise randomness into two distinct pyramid latent spaces. It transforms denoising into a self-supervised proxy task: swapping shuffled noisy samples from multiple sub-aperture transmissions. This allows the model to learn the “anatomy space” (low-rank components) and the “noise space” (high-rank components) purely at test time. 8| 9|Key Results 10|- Domain Shift Elimination: Because training occurs at test time on the specific sample being denoised, the framework fundamentally eliminates the domain shift and pretraining costs that plague traditional supervised regression models. 11|- Quantitative Gains: Simulation experiments showed an improvement of 69.3% in Signal-to-Noise Ratio (SNR) and 34.4% in Contrast-to-Noise Ratio (CNR). 12|- In Vivo Validation: Testing on real-world data (heart, liver, kidney) demonstrated impressive gains of 84.8% SNR and 25.7% CNR using only two aperture data points. 13|- Architecture Agnostic: The framework’s efficacy was validated across diverse imaging targets, noise levels, and aperture configurations. 14| 15|Implications 16|A2A represents a significant leap for medical imaging, particularly in modalities where “ground truth” clean images are impossible to acquire in vivo (such as a beating heart). By moving the learning process to the test phase and leveraging the inherent redundancy of multi-aperture acquisition, it provides a path toward high-fidelity anatomical visualization and functional assessment without the risk of introducing “fake textures” or artifacts common in pre-trained deep learning denoisers. 17| 18|Verdict 19|High Impact. The transition to pure test-time training for medical denoising solves the intractable “labeled data” problem and the “domain shift” problem simultaneously. 20|