
The conventional metrics for evaluating LLM inference include Time To First Token (TTFT), Time Between Tokens (TBT), normalized latency, and Time Per Output Token (TPOT). These metrics focus on various aspects of latency and throughput but may fall short in providing a comprehensive view of the user experience.

TTFT (Time to First Token) and TBT (Time Between Tokens) are insufficient for real-time applications because they focus on individual token latencies without considering end-to-end throughput, and they do not account for inter-token jitter and scheduling delays. These limitations make it difficult to maintain a smooth and consistent token generation rate, which is crucial for user satisfaction in real-time applications like chat and translation.

The Metron framework introduces two new metrics for evaluating LLM inference systems: fluidity-index and fluid token generation rate5. Fluidity-index measures the fraction of tokens that meet target latencies for a given request, considering prompt length and system performance. Fluid token generation rate determines the maximum tokens per second an inference system can serve while maintaining a specified fluidity-index for a certain percentage of requests5.