LoRA traditionally handles adapter parameters by freezing the pre-trained model weights and introducing trainable low-rank matrices, A and B. These matrices are learned for the specific downstream task, and their product approximates the weight updates. This approach reduces the number of trainable parameters and computational cost while maintaining model quality4.
Deploying large language models (LLMs) presents several challenges, including the need for significant computational resources, managing memory requirements, addressing potential security vulnerabilities, and ensuring compliance with data privacy regulations6. Additionally, optimizing inference latency and maintaining model quality while scaling are critical considerations for successful deployment.
Fusing adapter parameters in LoRA results in significant memory and latency costs during rapid switching, as it modifies a large portion of the base model's weights. This approach either loses the rapid switching capability or incurs up to 30% higher inference latency. Additionally, LoRA suffers from concept loss in multi-adapter settings, where different adapters overwrite each other's influence, degrading the model's performance.