Spatial-temporal multiplexing in MuxServe refers to the technique of efficiently serving multiple Large Language Models (LLMs) by colocating them based on popularity to multiplex memory resources. It separates prefill and decoding phases to flexibly colocate them, leveraging their distinct computation resource requirements. This approach maximizes GPU utilization and improves overall system performance.
MuxServe's greedy placement algorithm prioritizes LLMs with large computation requirements, considering both model scale and popularity1. It enumerates possible GPU mesh groups and, for each group, greedily places LLMs on different meshes to maximize the throughput of the LLM unit. This approach improves GPU utilization by colocating LLMs with varying popularity and resource needs.
MuxServe improves GPU utilization for LLMs by colocating LLMs considering their popularity to multiplex memory resources and leveraging the characteristics of prefill and decoding phases to separate and flexibly colocate them to multiplex computation resources. It uses a novel placement algorithm, adaptive batch scheduling strategy, and a unified resource manager to enable efficient multiplexing.