
The main types of optimizers used in training deep neural networks include first-order optimizers such as stochastic gradient descent with momentum (SGDM) and AdamW. These optimizers are widely used due to their efficiency in training large-scale models.
However, second-order optimizers, such as K-FAC, Shampoo, AdaBK, and Sophia, demonstrate superior convergence properties but often incur significant computational and memory costs. These costs hinder their widespread adoption for training large models within limited memory budgets. Second-order optimizers require the computation and storage of additional information, such as the Hessian matrix or its approximation, which increases memory consumption and computational complexity.
To address the memory consumption issue, researchers have explored two main approaches: factorization and quantization. Factorization uses low-rank approximation to represent optimizer states, while quantization techniques utilize low-bit representations to compress the 32-bit optimizer states. While quantization has been successfully applied to first-order optimizers, adapting it to second-order optimizers poses a greater challenge due to the matrix operations involved in these methods.
Recently, researchers have proposed the first 4-bit second-order optimizer, taking Shampoo as an example, while maintaining performance comparable to its 32-bit counterpart. The key contribution is quantizing the eigenvector matrix of the preconditioner in 4-bit Shampoo instead of directly quantizing the preconditioner itself. This approach preserves the small singular values of the preconditioner, which are crucial for accurately computing the inverse fourth root, thereby avoiding performance degradation.
Overall, while second-order optimizers offer superior convergence properties, their higher computational and memory costs have limited their widespread adoption. However, ongoing research aims to address these challenges and make second-order optimizers more practical for training large-scale deep neural networks.

Factorization and quantization approaches help in reducing the memory consumption of optimizer states in DNN training by addressing the computational and memory challenges associated with training large-scale models6.
Factorization uses low-rank approximation to represent optimizer states6. This strategy is applied to both first-order optimizers and second-order optimizers6. By using a low-rank approximation, the memory requirements for storing the optimizer states are significantly reduced, as only the most important information is retained6.
Quantization, on the other hand, utilizes low-bit representations to compress the 32-bit optimizer states. This technique has been successfully applied to first-order optimizers6. Adapting quantization to second-order optimizers is more challenging due to the matrix operations involved in these methods. However, researchers have presented the first 4-bit second-order optimizer, taking Shampoo as an example, while maintaining performance comparable to its 32-bit counterpart. The key contribution is quantizing the eigenvector matrix of the preconditioner in 4-bit Shampoo instead of directly quantizing the preconditioner itself. This approach preserves the small singular values of the preconditioner, which are crucial for accurately computing the inverse fourth root, thereby avoiding performance degradation.
By employing factorization and quantization approaches, the memory consumption of optimizer states can be significantly reduced, enabling more efficient training of deep neural networks6.

Adapting quantization techniques to second-order optimizers poses a greater challenge compared to first-order optimizers due to the matrix operations involved in second-order methods1. First-order optimizers' states are elementwise, while second-order optimizers rely on matrix operations1. This difference makes it more complex to apply quantization to second-order optimizers while maintaining their effectiveness and accuracy.