Introduction
Audio-text attention connects audio signals (speech, music, sound effects) with text. This is essential for automatic speech recognition (ASR), audio captioning, and speech synthesis. Audio differs from text/image in that it's a continuous signal requiring special encoding.
Audio Representation
1. Spectrogram Features
Raw audio → STFT (Short-Time Fourier Transform) → spectrogram
Shape: [time_frames, frequency_bins]
Treated as a "sequence" where each time frame is a token
Shape: [time_frames, frequency_bins]
Treated as a "sequence" where each time frame is a token
2. Mel-Spectrogram
More perceptual representation
Shape: [T, F_mel] where F_mel typically 80 or 128 bins
Shape: [T, F_mel] where F_mel typically 80 or 128 bins
3. MFCC Features
Mel-frequency cepstral coefficients
Compact representation of audio
Compact representation of audio
Audio-Text Attention Architecture
Audio encoder: 1D CNN / Transformer → audio features
Text encoder: Standard transformer → text features
Cross-attention: text queries attend to audio features
Text encoder: Standard transformer → text features
Cross-attention: text queries attend to audio features
Applications
1. Automatic Speech Recognition (ASR)
Audio → encoder → cross-attend with text decoder
Text decoder generates transcription
Text decoder generates transcription
2. Audio Captioning
Audio → encoder → cross-attention → text decoder
Generate natural language description of audio
Generate natural language description of audio
3. Speech-to-Text Translation
Audio in language A → cross-attend → text in language B
Challenges
- Variable length: Audio can be much longer than text for same content
- Alignment: Word timing in audio is not straightforward
- Feature extraction: Need good audio preprocessing