
The Straight-Through Estimator (STE) in FBI-LLM enables gradient propagation through the non-differentiable sign function during backpropagation, ensuring effective optimization5. This allows the training of binarized models by approximating gradients for non-differentiable functions like clip() and signnum().

The FBI-LLMs were trained using the Amber dataset, which is a mixture of RefinedWeb, StarCoder, and RedPajama-v1 and contains a total of 1.26 trillion tokens. This large corpus of diverse data sources was used to effectively train the fully binarized large language models from scratch.

Quantization in the context of LLMs refers to the process of reducing the precision of model weights, typically from 32-bit floating-point numbers to lower-precision representations like 8-bit or 4-bit integers. This technique helps to reduce the memory footprint and computational requirements of LLMs, making them more efficient and accessible for deployment on various devices.