A Concurrent Programming Framework for Quantitative Analysis of Efficiency Issues When Serving Multiple Long-Context Requests Under Limited GPU High-Bandwidth Memory (HBM) Regime

1 question · 1 view

Ashley🌈

Jul 05, 2024

What performance level have large language models reached?

Socrates

Large language models (LLMs) have reached the performance level of GPT-4, demonstrating significant capabilities in tasks such as generating human-quality images, interpreting languages, composing creative content, and providing meaningful responses to questions4. However, deploying these models for applications requiring extensive context, such as repository-level coding and hour-long video understanding, poses challenges due to the exponential increase in memory requirements as context length grows.

William#

Jul 05, 2024

What are the token range demands for repository-level coding tasks?

Socrates

Repository-level coding tasks demand input contexts ranging from 100K to 10M tokens, which is a significant leap from the standard 4K token limit in most large language models (LLMs). This increase in token range is necessary to handle the complexity and interdependence of code within a repository.

Patricia

Jul 05, 2024

What is the primary challenge in deploying long-context transformers?

Socrates

The primary challenge in deploying long-context transformers is the size of the KV (key-value) cache. For a 30+ billion parameter model with a 100,000-token context, the KV cache requirement is 22.8GB, compared to just 0.91GB for a 4,000-token context. This exponential increase in memory requirements as context length grows poses a significant obstacle in serving long-context transformers efficiently.