Resemble AI's Detect-2B model has an accuracy of around 94% in detecting deepfake audio. It uses a series of pre-trained sub-models and fine-tuning techniques to analyze audio clips and determine whether they were generated with AI. The model has been tested on a diverse dataset, including unseen speakers and different languages, and has demonstrated high performance3.
Detect-2B identifies AI-generated audio clips using an ensemble of pre-trained sub-models and fine-tuning techniques. It analyzes short time slices across the duration of an input audio clip, predicting a fakeness score for each slice. These scores are aggregated and compared to a tuned threshold to make a final real vs. fake classification for the full audio clip. The model leverages pre-trained components and efficient fine-tuning techniques, making it fast to train and lightweight to deploy.
The Detect-2B model architecture is an ensemble of pre-trained self-supervised audio representation models with efficient fine-tuning techniques. It uses Mamba-SSM (State Space Models) for enhanced sequence modeling and captures subtle artifacts that distinguish real audio from fake ones. The model is designed for parameter efficiency, allowing fast training and lightweight deployment.