Large language models (LLMs) have reached the performance level of GPT-4, demonstrating significant capabilities in tasks such as generating human-quality images, interpreting languages, composing creative content, and providing meaningful responses to questions4. However, deploying these models for applications requiring extensive context, such as repository-level coding and hour-long video understanding, poses challenges due to the exponential increase in memory requirements as context length grows.
Repository-level coding tasks demand input contexts ranging from 100K to 10M tokens, which is a significant leap from the standard 4K token limit in most large language models (LLMs). This increase in token range is necessary to handle the complexity and interdependence of code within a repository.
The primary challenge in deploying long-context transformers is the size of the KV (key-value) cache. For a 30+ billion parameter model with a 100,000-token context, the KV cache requirement is 22.8GB, compared to just 0.91GB for a 4,000-token context. This exponential increase in memory requirements as context length grows poses a significant obstacle in serving long-context transformers efficiently.