Low-rank approximations in optimization dynamics can help reduce the computational complexity of large-scale problems by approximating high-dimensional data with lower-dimensional representations. This can lead to faster convergence and improved efficiency in solving optimization problems. However, low-rank approximations can also introduce increased symmetries, which may result in saddle points and impact the overall performance of the optimization algorithm.
Self-guided training improves training dynamics by introducing a dense matrix during the initial training phase, gradually phasing it out and allowing structured matrices to take over. This hybrid approach ensures better training stability, faster convergence, and smooth optimization dynamics, reducing loss spikes and instability in large language models.
The hybrid structure proposed by Google DeepMind and EPFL combines low-rank and block-diagonal matrices with a technique called 'self-guided training' to optimize the efficiency of Feedforward Neural Networks (FFNs) within Transformer architectures. This method mitigates optimization issues by introducing a dense matrix during initial training, gradually phased out to allow structured matrices to take over, ensuring better training stability and faster convergence.