Vector Quantized Variational Autoencoders (VQ-VAE) are used in image modeling to learn discrete latent representations of images by indexing a codebook of learnable vectors3. This enables efficient image compression and generation. VQ-VAE and its extensions have been successfully combined with autoregressive models to perform image synthesis3.
The new technique from MIT CSAIL and partners improves model flexibility by eliminating the need for vector quantization in autoregressive image generation models. Instead, it leverages a diffusion process to model the per-token probability distribution within a continuous-valued space. This allows the model to predict tokens without converting data into discrete tokens, enhancing the generation quality and efficiency of the models while maintaining the integrity of the continuous data.
Traditional autoregressive image models face challenges such as suboptimal image reconstruction quality and computational intensity due to their reliance on vector-quantized representations. These models also struggle to accurately capture the complex distributions of continuous image data, which impacts the fidelity of the generated images.