
Whisper WebGPU is a groundbreaking technology that leverages OpenAI's Whisper model to enable real-time, in-browser speech recognition. It was developed by a Hugging Face Engineer, known by the nickname 'Xenova.' This innovation allows for advanced speech recognition directly in the browser, enhancing privacy and enabling functionality even when offline6.

Whisper WebGPU achieves real-time speech recognition within a web browser by leveraging OpenAI's Whisper model and utilizing Hugging Face Transformers.js and ONNX Runtime Web. The Whisper-base model, a 73-million-parameter speech recognition model optimized for web inference, is used for this purpose. With a model size of approximately 200 MB, Whisper-base is designed to be lightweight yet powerful.
Once the model is downloaded, it is cached for future use, ensuring that subsequent interactions are swift and seamless. All computations are performed locally within the user's browser, eliminating the need to send data to a server. This enhances privacy and enables functionality even when the device is offline. Users can disconnect from the internet after the initial model load and still benefit from Whisper's robust speech recognition capabilities.
The use of ONNX (Open Neural Network Exchange) weights adds to the innovation of Whisper WebGPU. ONNX is an open-source format for AI models, allowing models trained in different frameworks to be shared and utilized seamlessly.
In summary, Whisper WebGPU achieves real-time speech recognition within a web browser through a combination of the optimized Whisper-base model, local computations using Transformers.js and ONNX Runtime Web, and the use of ONNX weights for model sharing and utilization.