sima_utils.transformer.tokenizer.whisper_tokenizerο
Attributesο
Classesο
A thin wrapper around tiktoken providing quick access to special tokens |
Functionsο
Returns list of utf-8 byte and a mapping to unicode strings. We specifically avoids mapping to |
|
|
|
|
Module Contentsο
- sima_utils.transformer.tokenizer.whisper_tokenizer.LANGUAGESο
- sima_utils.transformer.tokenizer.whisper_tokenizer.TO_LANGUAGE_CODEο
- class sima_utils.transformer.tokenizer.whisper_tokenizer.Tokenizerο
A thin wrapper around tiktoken providing quick access to special tokens
- encoding: tiktoken.Encodingο
- num_languages: intο
- language: str | None = Noneο
- task: str | None = Noneο
- sot_sequence: tuple[int] = ()ο
- special_tokens: dict[str, int]ο
- encode(text, **kwargs)ο
- decode(token_ids: list[int], **kwargs) str ο
- decode_with_timestamps(token_ids: list[int], **kwargs) str ο
Timestamp tokens are above other special tokensβ id range and are ignored by decode(). This method decodes given tokens with timestamps tokens annotated, e.g. β<|1.08|>β.
- decode_without_special_tokens(token_ids: list[int], **kwargs) str ο
- property eot: intο
- property transcribe: intο
- property translate: intο
- property sot: intο
- property sot_lm: intο
- property sot_prev: intο
- property no_speech: intο
- property no_timestamps: intο
- property timestamp_begin: intο
- property language_token: intο
Returns the token id corresponding to the value of the language field
- to_language_token(language: str) int ο
- property all_language_tokens: tuple[int]ο
- property all_language_codes: tuple[str]ο
- property sot_sequence_including_notimestamps: tuple[int]ο
- property non_speech_tokens: tuple[int]ο
Returns the list of tokens to suppress in order to avoid any speaker tags or non-speech annotations, to prevent sampling texts that are not actually spoken in the audio, e.g.
βͺβͺβͺ
( SPEAKING FOREIGN LANGUAGE )
[DAVID] Hey there,
keeping basic punctuations like commas, periods, question marks, exclamation points, etc.
- split_to_word_tokens(tokens: list[int])ο
- split_tokens_on_unicode(tokens: list[int])ο
- split_tokens_on_spaces(tokens: list[int])ο
- sima_utils.transformer.tokenizer.whisper_tokenizer.get_byte_decoder()ο
Returns list of utf-8 byte and a mapping to unicode strings. We specifically avoids mapping to whitespace/control characters the bpe code barfs on.
The reversible bpe codes work on unicode strings. This means you need a large # of unicode characters in your vocab if you want to avoid UNKs. When youβre at something like a 10B token dataset you end up needing around 5K for decent coverage. This is a significant percentage of your normal, say, 32K bpe vocab. To avoid that, we want lookup tables between utf-8 bytes and unicode strings.
- sima_utils.transformer.tokenizer.whisper_tokenizer.get_encoding(name: str = 'gpt2', num_languages: int = 99, hf_tokenizer_json_file: pathlib.Path | None = None)ο