
The primary metrics used in the benchmark study conducted by the BentoML engineering team to evaluate the performance of various inference backends are Time to First Token (TTFT) and Token Generation Rate1. Time to First Token (TTFT) measures the latency from when a request is sent to when the first token is generated, which is crucial for applications requiring immediate feedback. Token Generation Rate assesses how many tokens the model generates per second during decoding, indicating the model's capacity to handle high loads efficiently.

The findings regarding the TTFT for the vLLM backend across different user levels were consistently low latency rates. This makes vLLM a strong choice for applications that require quick response times.