sima_utils.transformer.tokenizer.whisper_tokenizer

Attributes

`LANGUAGES`
`TO_LANGUAGE_CODE`

Classes

Tokenizer

A thin wrapper around tiktoken providing quick access to special tokens

Functions

`get_byte_decoder`()	Returns list of utf-8 byte and a mapping to unicode strings. We specifically avoids mapping to
`get_encoding`([name, num_languages, hf_tokenizer_json_file])
`get_tokenizer`(→ Tokenizer)

Module Contents

sima_utils.transformer.tokenizer.whisper_tokenizer.LANGUAGES

sima_utils.transformer.tokenizer.whisper_tokenizer.TO_LANGUAGE_CODE

class sima_utils.transformer.tokenizer.whisper_tokenizer.Tokenizer

A thin wrapper around tiktoken providing quick access to special tokens

encoding: tiktoken.Encoding

num_languages: int

language: str | None = None

task: str | None = None

sot_sequence: tuple[int] = ()

special_tokens: dict[str, int]

encode(text, **kwargs)

decode(token_ids: list[int], **kwargs) → str

decode_with_timestamps(token_ids: list[int], **kwargs) → str: Timestamp tokens are above other special tokens’ id range and are ignored by decode(). This method decodes given tokens with timestamps tokens annotated, e.g. “<|1.08|>”.

decode_without_special_tokens(token_ids: list[int], **kwargs) → str

property eot: int

property transcribe: int

property translate: int

property sot: int

property sot_lm: int

property sot_prev: int

property no_speech: int

property no_timestamps: int

property timestamp_begin: int

property language_token: int: Returns the token id corresponding to the value of the language field

to_language_token(language: str) → int

property all_language_tokens: tuple[int]

property all_language_codes: tuple[str]

property sot_sequence_including_notimestamps: tuple[int]

property non_speech_tokens: tuple[int]

Returns the list of tokens to suppress in order to avoid any speaker tags or non-speech annotations, to prevent sampling texts that are not actually spoken in the audio, e.g.

♪♪♪
( SPEAKING FOREIGN LANGUAGE )
[DAVID] Hey there,

keeping basic punctuations like commas, periods, question marks, exclamation points, etc.

split_to_word_tokens(tokens: list[int])

split_tokens_on_unicode(tokens: list[int])

split_tokens_on_spaces(tokens: list[int])

sima_utils.transformer.tokenizer.whisper_tokenizer.get_byte_decoder()

Returns list of utf-8 byte and a mapping to unicode strings. We specifically avoids mapping to whitespace/control characters the bpe code barfs on.

The reversible bpe codes work on unicode strings. This means you need a large # of unicode characters in your vocab if you want to avoid UNKs. When you’re at something like a 10B token dataset you end up needing around 5K for decent coverage. This is a significant percentage of your normal, say, 32K bpe vocab. To avoid that, we want lookup tables between utf-8 bytes and unicode strings.

sima_utils.transformer.tokenizer.whisper_tokenizer.get_encoding(name: str = 'gpt2', num_languages: int = 99, hf_tokenizer_json_file: pathlib.Path | None = None)

sima_utils.transformer.tokenizer.whisper_tokenizer.get_tokenizer(multilingual: bool, *, num_languages: int = 99, language: str | None = None, task: str | None = None, hf_tokenizer_json_file: pathlib.Path | None = None) → Tokenizer