sima_utils.transformer.tokenizer.whisper_tokenizer ================================================== .. py:module:: sima_utils.transformer.tokenizer.whisper_tokenizer Attributes ---------- .. autoapisummary:: sima_utils.transformer.tokenizer.whisper_tokenizer.LANGUAGES sima_utils.transformer.tokenizer.whisper_tokenizer.TO_LANGUAGE_CODE Classes ------- .. autoapisummary:: sima_utils.transformer.tokenizer.whisper_tokenizer.Tokenizer Functions --------- .. autoapisummary:: sima_utils.transformer.tokenizer.whisper_tokenizer.get_byte_decoder sima_utils.transformer.tokenizer.whisper_tokenizer.get_encoding sima_utils.transformer.tokenizer.whisper_tokenizer.get_tokenizer Module Contents --------------- .. py:data:: LANGUAGES .. py:data:: TO_LANGUAGE_CODE .. py:class:: Tokenizer A thin wrapper around `tiktoken` providing quick access to special tokens .. py:attribute:: encoding :type: tiktoken.Encoding .. py:attribute:: num_languages :type: int .. py:attribute:: language :type: str | None :value: None .. py:attribute:: task :type: str | None :value: None .. py:attribute:: sot_sequence :type: tuple[int] :value: () .. py:attribute:: special_tokens :type: dict[str, int] .. py:method:: encode(text, **kwargs) .. py:method:: decode(token_ids: list[int], **kwargs) -> str .. py:method:: decode_with_timestamps(token_ids: list[int], **kwargs) -> str Timestamp tokens are above other special tokens' id range and are ignored by `decode()`. This method decodes given tokens with timestamps tokens annotated, e.g. "<|1.08|>". .. py:method:: decode_without_special_tokens(token_ids: list[int], **kwargs) -> str .. py:property:: eot :type: int .. py:property:: transcribe :type: int .. py:property:: translate :type: int .. py:property:: sot :type: int .. py:property:: sot_lm :type: int .. py:property:: sot_prev :type: int .. py:property:: no_speech :type: int .. py:property:: no_timestamps :type: int .. py:property:: timestamp_begin :type: int .. py:property:: language_token :type: int Returns the token id corresponding to the value of the `language` field .. py:method:: to_language_token(language: str) -> int .. py:property:: all_language_tokens :type: tuple[int] .. py:property:: all_language_codes :type: tuple[str] .. py:property:: sot_sequence_including_notimestamps :type: tuple[int] .. py:property:: non_speech_tokens :type: tuple[int] Returns the list of tokens to suppress in order to avoid any speaker tags or non-speech annotations, to prevent sampling texts that are not actually spoken in the audio, e.g. - ♪♪♪ - ( SPEAKING FOREIGN LANGUAGE ) - [DAVID] Hey there, keeping basic punctuations like commas, periods, question marks, exclamation points, etc. .. py:method:: split_to_word_tokens(tokens: list[int]) .. py:method:: split_tokens_on_unicode(tokens: list[int]) .. py:method:: split_tokens_on_spaces(tokens: list[int]) .. py:function:: get_byte_decoder() Returns list of utf-8 byte and a mapping to unicode strings. We specifically avoids mapping to whitespace/control characters the bpe code barfs on. The reversible bpe codes work on unicode strings. This means you need a large # of unicode characters in your vocab if you want to avoid UNKs. When you're at something like a 10B token dataset you end up needing around 5K for decent coverage. This is a significant percentage of your normal, say, 32K bpe vocab. To avoid that, we want lookup tables between utf-8 bytes and unicode strings. .. py:function:: get_encoding(name: str = 'gpt2', num_languages: int = 99, hf_tokenizer_json_file: pathlib.Path | None = None) .. py:function:: get_tokenizer(multilingual: bool, *, num_languages: int = 99, language: str | None = None, task: str | None = None, hf_tokenizer_json_file: pathlib.Path | None = None) -> Tokenizer