sima_utils.transformer.tokenizer.whisper_tokenizer

Attributes

LANGUAGES

TO_LANGUAGE_CODE

Classes

Tokenizer

A thin wrapper around tiktoken providing quick access to special tokens

Functions

get_byte_decoder()

Returns list of utf-8 byte and a mapping to unicode strings. We specifically avoids mapping to

get_encoding([name,Β num_languages,Β hf_tokenizer_json_file])

get_tokenizer(β†’Β Tokenizer)

Module Contents

sima_utils.transformer.tokenizer.whisper_tokenizer.LANGUAGES
sima_utils.transformer.tokenizer.whisper_tokenizer.TO_LANGUAGE_CODE
class sima_utils.transformer.tokenizer.whisper_tokenizer.Tokenizer

A thin wrapper around tiktoken providing quick access to special tokens

encoding: tiktoken.Encoding
num_languages: int
language: str | None = None
task: str | None = None
sot_sequence: tuple[int] = ()
special_tokens: dict[str, int]
encode(text, **kwargs)
decode(token_ids: list[int], **kwargs) str
decode_with_timestamps(token_ids: list[int], **kwargs) str

Timestamp tokens are above other special tokens’ id range and are ignored by decode(). This method decodes given tokens with timestamps tokens annotated, e.g. β€œ<|1.08|>”.

decode_without_special_tokens(token_ids: list[int], **kwargs) str
property eot: int
property transcribe: int
property translate: int
property sot: int
property sot_lm: int
property sot_prev: int
property no_speech: int
property no_timestamps: int
property timestamp_begin: int
property language_token: int

Returns the token id corresponding to the value of the language field

to_language_token(language: str) int
property all_language_tokens: tuple[int]
property all_language_codes: tuple[str]
property sot_sequence_including_notimestamps: tuple[int]
property non_speech_tokens: tuple[int]

Returns the list of tokens to suppress in order to avoid any speaker tags or non-speech annotations, to prevent sampling texts that are not actually spoken in the audio, e.g.

  • β™ͺβ™ͺβ™ͺ

  • ( SPEAKING FOREIGN LANGUAGE )

  • [DAVID] Hey there,

keeping basic punctuations like commas, periods, question marks, exclamation points, etc.

split_to_word_tokens(tokens: list[int])
split_tokens_on_unicode(tokens: list[int])
split_tokens_on_spaces(tokens: list[int])
sima_utils.transformer.tokenizer.whisper_tokenizer.get_byte_decoder()

Returns list of utf-8 byte and a mapping to unicode strings. We specifically avoids mapping to whitespace/control characters the bpe code barfs on.

The reversible bpe codes work on unicode strings. This means you need a large # of unicode characters in your vocab if you want to avoid UNKs. When you’re at something like a 10B token dataset you end up needing around 5K for decent coverage. This is a significant percentage of your normal, say, 32K bpe vocab. To avoid that, we want lookup tables between utf-8 bytes and unicode strings.

sima_utils.transformer.tokenizer.whisper_tokenizer.get_encoding(name: str = 'gpt2', num_languages: int = 99, hf_tokenizer_json_file: pathlib.Path | None = None)
sima_utils.transformer.tokenizer.whisper_tokenizer.get_tokenizer(multilingual: bool, *, num_languages: int = 99, language: str | None = None, task: str | None = None, hf_tokenizer_json_file: pathlib.Path | None = None) Tokenizer