YaFSDP optimizes memory consumption during the training of large language models by sharding weights, gradients, and optimizer states across multiple GPUs. This eliminates the need for duplicating these components on each GPU, significantly reducing memory usage. Additionally, YaFSDP leverages activation checkpointing, a technique that stores only necessary activations during the forward pass and recomputes them during the backward pass. This reduces the memory footprint without compromising the training process. By optimizing memory consumption, YaFSDP enables more efficient training of large language models.
YaFSDP is an open-source tool developed by Yandex that aims to revolutionize the training of large language models (LLMs) by significantly reducing GPU resource consumption and training time. It focuses on optimizing memory consumption and eliminating communication bottlenecks to enhance the efficiency of LLM training. YaFSDP achieves this by sharding layers instead of individual parameters, maintaining efficient communications, and avoiding redundant operations. It also pre-allocates buffers for all required data to ensure efficient memory usage.
YaFSDP reduces memory consumption by optimizing the storage and access of weights, gradients, optimizer states, buffers, and activations during training. It utilizes activation checkpointing to store only necessary activations and recompute them during the backward pass, reducing the memory footprint. Additionally, YaFSDP optimizes GPU communication by ensuring data is transferred only when necessary and overlapping communication with computation using CUDA streams.
By implementing YaFSDP, significant improvements in training efficiency have been observed. In a pre-training scenario with a 70 billion parameter model, YaFSDP saved the resources of approximately 150 GPUs, resulting in potential monthly cost savings ranging from $0.5 to $1.5 million, depending on the virtual GPU provider or platform. It also reduced training time by up to 26% compared to existing methods like FSDP.
Yandex has made YaFSDP publicly available on GitHub, allowing ML engineers to enhance the efficiency of their LLM training processes. By open-sourcing YaFSDP, Yandex aims to foster innovation and collaboration in the AI community, enabling developers to train models faster and cost-effectively.