sima_utils.transformer.preproc.whisper_preproc

Attributes

SAMPLE_RATE

N_FFT

HOP_LENGTH

CHUNK_LENGTH

N_SAMPLES

N_FRAMES

stft_window

Functions

get_ffmpeg_file(→ pathlib.Path)

load_audio(→ numpy.ndarray)

Open an audio file and read as mono waveform, resampling as necessary

pad_or_trim(→ numpy.ndarray)

Pad or trim the audio array to N_SAMPLES, as expected by the encoder.

mel_filters(→ numpy.ndarray)

load the mel filterbank matrix for projecting STFT into a Mel spectrogram.

log_mel_spectrogram(audio[, n_mels, ...])

Compute the log-Mel spectrogram of

stft_numpy(→ numpy.ndarray)

preprocess_audio(→ numpy.ndarray)

load_and_preprocess_numpy(→ numpy.ndarray)

Module Contents

sima_utils.transformer.preproc.whisper_preproc.SAMPLE_RATE = 16000
sima_utils.transformer.preproc.whisper_preproc.N_FFT = 400
sima_utils.transformer.preproc.whisper_preproc.HOP_LENGTH = 160
sima_utils.transformer.preproc.whisper_preproc.CHUNK_LENGTH = 30
sima_utils.transformer.preproc.whisper_preproc.N_SAMPLES = 480000
sima_utils.transformer.preproc.whisper_preproc.N_FRAMES = 3000
sima_utils.transformer.preproc.whisper_preproc.get_ffmpeg_file() pathlib.Path
sima_utils.transformer.preproc.whisper_preproc.load_audio(file: str, sr: int = SAMPLE_RATE) numpy.ndarray

Open an audio file and read as mono waveform, resampling as necessary

Parameters:
  • file – The audio file to open.

  • sr – The sample rate to resample the audio if necessary.

Returns

A NumPy array containing the audio waveform, in float32 dtype.

sima_utils.transformer.preproc.whisper_preproc.pad_or_trim(array: numpy.ndarray, length: int = N_SAMPLES, *, axis: int = -1) numpy.ndarray

Pad or trim the audio array to N_SAMPLES, as expected by the encoder.

sima_utils.transformer.preproc.whisper_preproc.mel_filters(n_mels: int, hf_preprocessor_config_json_file: pathlib.Path | None = None) numpy.ndarray

load the mel filterbank matrix for projecting STFT into a Mel spectrogram. Allows decoupling librosa dependency; saved using:

np.savez_compressed(

“mel_filters.npz”, mel_80=librosa.filters.mel(sr=16000, n_fft=400, n_mels=80), mel_128=librosa.filters.mel(sr=16000, n_fft=400, n_mels=128),

)

sima_utils.transformer.preproc.whisper_preproc.stft_window
sima_utils.transformer.preproc.whisper_preproc.log_mel_spectrogram(audio: numpy.ndarray, n_mels: int = 80, hf_preprocessor_config_json_file: pathlib.Path | None = None, stft_style: str = 'scipy')

Compute the log-Mel spectrogram of

Parameters:
  • audio – A NumPy array containing the audio waveform in 16 kHz.

  • n_mels – The number of Mel-frequency filters, only 80 is supported.

Returns:

A Tensor that contains the Mel spectrogram.

sima_utils.transformer.preproc.whisper_preproc.stft_numpy(signal: numpy.ndarray, window_size: int = 400, hop_size: int = 160, pad_mode: str = 'reflect', window_type: str = 'hann', center: bool = True) numpy.ndarray
sima_utils.transformer.preproc.whisper_preproc.preprocess_audio(audio: numpy.ndarray, hf_preprocessor_config_json_file: pathlib.Path | None) numpy.ndarray
sima_utils.transformer.preproc.whisper_preproc.load_and_preprocess_numpy(audio_file: pathlib.Path, hf_preprocessor_config_json_file: pathlib.Path | None, out_dtype: type | str = np.float32) numpy.ndarray