48. Audio-Text Attention

Introduction

Audio-text attention connects audio signals (speech, music, sound effects) with text. This is essential for automatic speech recognition (ASR), audio captioning, and speech synthesis. Audio differs from text/image in that it's a continuous signal requiring special encoding.

Audio Representation

1. Spectrogram Features

Raw audio → STFT (Short-Time Fourier Transform) → spectrogram

Shape: [time_frames, frequency_bins]

Treated as a "sequence" where each time frame is a token

2. Mel-Spectrogram

More perceptual representation
Shape: [T, F_mel] where F_mel typically 80 or 128 bins

3. MFCC Features

Mel-frequency cepstral coefficients
Compact representation of audio

Audio-Text Attention Architecture

Audio encoder: 1D CNN / Transformer → audio features
Text encoder: Standard transformer → text features

Cross-attention: text queries attend to audio features

Applications

1. Automatic Speech Recognition (ASR)

Audio → encoder → cross-attend with text decoder

Text decoder generates transcription

2. Audio Captioning

Audio → encoder → cross-attention → text decoder

Generate natural language description of audio

3. Speech-to-Text Translation

Audio in language A → cross-attend → text in language B

Challenges

Test Your Understanding

Question 1: Audio is converted to sequence via:

  • A) Direct text conversion
  • B) Spectrogram / Mel-spectrogram representation
  • C) Image conversion
  • D) No conversion

Question 2: STFT stands for:

  • A) Simple Time Fourier Process
  • B) Short-Time Fourier Transform
  • C) Signal Time Frequency Processing
  • D) Smooth Time Frequency Presentation

Question 3: In ASR, audio encoder outputs are used as:

  • A) Queries
  • B) Keys and Values for text decoder attention
  • C) Final output
  • D> No attention

Question 4: Mel-spectrogram typically has how many frequency bins?

  • A) 3
  • B) 10
  • C) 80 or 128
  • D) 10000

Question 5: Audio-text attention enables:

  • A) Text-only processing
  • B) Audio understanding with text alignment
  • C) Image processing
  • D) No cross-modality

Question 6: One challenge with audio-text is:

  • A) Audio is shorter than text
  • B) Variable length and alignment issues
  • C) No challenge
  • D) Text is too long

Question 7: Audio spectrogram shape is:

  • A) [batch]
  • B) [time_frames, frequency_bins]
  • C) [words]
  • D) [sentences]

Question 8: Audio captioning generates:

  • A) Speech
  • B) Image
  • C) Natural language description of audio
  • D) Music notes