sima_utils.transformer.preproc.whisper_preproc

Attributes

`SAMPLE_RATE`
`N_FFT`
`HOP_LENGTH`
`CHUNK_LENGTH`
`N_SAMPLES`
`N_FRAMES`
`stft_window`

Functions

`get_ffmpeg_file`(→ pathlib.Path)
`load_audio`(→ numpy.ndarray)	Open an audio file and read as mono waveform, resampling as necessary
`pad_or_trim`(→ numpy.ndarray)	Pad or trim the audio array to N_SAMPLES, as expected by the encoder.
`mel_filters`(→ numpy.ndarray)	load the mel filterbank matrix for projecting STFT into a Mel spectrogram.
`log_mel_spectrogram`(audio[, n_mels, ...])	Compute the log-Mel spectrogram of
`stft_numpy`(→ numpy.ndarray)
`preprocess_audio`(→ numpy.ndarray)
`load_and_preprocess_numpy`(→ numpy.ndarray)

Module Contents

sima_utils.transformer.preproc.whisper_preproc.SAMPLE_RATE = 16000

sima_utils.transformer.preproc.whisper_preproc.N_FFT = 400

sima_utils.transformer.preproc.whisper_preproc.HOP_LENGTH = 160

sima_utils.transformer.preproc.whisper_preproc.CHUNK_LENGTH = 30

sima_utils.transformer.preproc.whisper_preproc.N_SAMPLES = 480000

sima_utils.transformer.preproc.whisper_preproc.N_FRAMES = 3000

sima_utils.transformer.preproc.whisper_preproc.get_ffmpeg_file() → pathlib.Path

sima_utils.transformer.preproc.whisper_preproc.load_audio(file: str, sr: int = SAMPLE_RATE) → numpy.ndarray

Open an audio file and read as mono waveform, resampling as necessary

Parameters:

file – The audio file to open.
sr – The sample rate to resample the audio if necessary.

Returns: A NumPy array containing the audio waveform, in float32 dtype.

sima_utils.transformer.preproc.whisper_preproc.pad_or_trim(array: numpy.ndarray, length: int = N_SAMPLES, *, axis: int = -1) → numpy.ndarray: Pad or trim the audio array to N_SAMPLES, as expected by the encoder.

sima_utils.transformer.preproc.whisper_preproc.mel_filters(n_mels: int, hf_preprocessor_config_json_file: pathlib.Path | None = None) → numpy.ndarray

load the mel filterbank matrix for projecting STFT into a Mel spectrogram. Allows decoupling librosa dependency; saved using:

np.savez_compressed(
“mel_filters.npz”, mel_80=librosa.filters.mel(sr=16000, n_fft=400, n_mels=80), mel_128=librosa.filters.mel(sr=16000, n_fft=400, n_mels=128),

)

sima_utils.transformer.preproc.whisper_preproc.stft_window

sima_utils.transformer.preproc.whisper_preproc.log_mel_spectrogram(audio: numpy.ndarray, n_mels: int = 80, hf_preprocessor_config_json_file: pathlib.Path | None = None, stft_style: str = 'scipy')

Compute the log-Mel spectrogram of

Parameters:

audio – A NumPy array containing the audio waveform in 16 kHz.
n_mels – The number of Mel-frequency filters, only 80 is supported.

Returns:

A Tensor that contains the Mel spectrogram.

sima_utils.transformer.preproc.whisper_preproc.stft_numpy(signal: numpy.ndarray, window_size: int = 400, hop_size: int = 160, pad_mode: str = 'reflect', window_type: str = 'hann', center: bool = True) → numpy.ndarray

sima_utils.transformer.preproc.whisper_preproc.preprocess_audio(audio: numpy.ndarray, hf_preprocessor_config_json_file: pathlib.Path | None) → numpy.ndarray

sima_utils.transformer.preproc.whisper_preproc.load_and_preprocess_numpy(audio_file: pathlib.Path, hf_preprocessor_config_json_file: pathlib.Path | None, out_dtype: type | str = np.float32) → numpy.ndarray