Current solutions to mitigate the quadratic complexity of attention in LLMs often compromise model accuracy or require additional pretraining. Approaches like sparse attention, low-rank matrices, and external memory aim to approximate dense attention or manage memory more efficiently but can lead to accuracy losses and impracticality for pre-trained models.
SampleAttention is an adaptive structured sparse attention mechanism proposed by a team of researchers from China3. It aims to reduce computational overhead in large language models (LLMs) with long context windows while maintaining accuracy6. The method captures essential information by attending to a fixed percentage of adjacent tokens for local window patterns and employing a two-stage query-guided key-value filtering approach for column stripe patterns23. This near-lossless sparse attention can be seamlessly integrated into off-the-shelf LLMs without compromising accuracy6.
The main issue with standard attention in LLMs is its quadratic complexity, which significantly increases computational time and results in high Time-to-First-Token (TTFT) latency, making real-time interactions challenging. This complexity often requires extra pretraining or finetuning and can compromise model accuracy.