Taming Long Audio Sequences: Audio Mamba Achieves Transformer-Level Performance Without Self-Attention
What specific results did Audio Mamba achieve on benchmarks such as AudioSet, VGGSound, and VoxCeleb?

Audio Mamba demonstrated competitive performance across various benchmarks, including AudioSet, VGGSound, and VoxCeleb. On the AudioSet dataset, AuM achieved a mean average precision (mAP) of 32.43%, surpassing AST's 29.10%. In the VGGSound dataset, Audio Mamba outperformed AST with a substantial accuracy improvement of over 5%, achieving 42.58% accuracy compared to AST's 37.25%.
What are the primary challenges associated with using transformers in audio classification?

The primary challenge associated with using transformers in audio classification is the computational complexity, particularly due to the self-attention mechanism. This mechanism scales quadratically with the sequence length, making it inefficient for processing long audio sequences. This inefficiency can be a significant hurdle in various applications, from speech recognition to environmental sound classification, as it requires more computational resources and time. Addressing this issue is crucial for developing models that can efficiently handle audio data's increasing volume and complexity.
What technological advancements have led to the evolution of audio classification methods?

The evolution of audio classification methods has been significantly influenced by the advancements in machine learning and deep learning techniques. Initially, Convolutional Neural Networks (CNNs) were the dominant approach in audio classification. However, the field has recently shifted towards transformer-based architectures, which offer improved performance and the ability to handle various tasks through a unified approach.
Transformers, with their self-attention mechanisms, have surpassed CNNs in performance, particularly in tasks that require extensive contextual understanding and handling of diverse input data types. This shift has led to a paradigm shift in deep learning, especially for audio classification tasks.
One of the most prominent methods for audio classification is the Audio Spectrogram Transformer (AST). ASTs utilize self-attention mechanisms to capture the global context in audio data. However, they suffer from high computational costs due to the computational complexity associated with transformers, particularly their self-attention mechanism, which scales quadratically with the sequence length.
To address this issue, researchers have explored alternative methods such as state space models (SSMs). SSMs, such as Mamba, have shown promise in language and vision tasks by replacing self-attention with time-varying parameters to capture global context more efficiently. The introduction of Audio Mamba (AuM), a novel self-attention-free model based on state space models for audio classification, has further pushed the boundaries of audio classification by efficiently processing audio spectrograms using a bidirectional approach.
Overall, the evolution of audio classification methods has been driven by advancements in machine learning, deep learning, and the development of more efficient models like Audio Mamba.