sima_utils.transformer.tokenizer.whisper_tokenizer
==================================================

.. py:module:: sima_utils.transformer.tokenizer.whisper_tokenizer


Attributes
----------

.. autoapisummary::

   sima_utils.transformer.tokenizer.whisper_tokenizer.LANGUAGES
   sima_utils.transformer.tokenizer.whisper_tokenizer.TO_LANGUAGE_CODE


Classes
-------

.. autoapisummary::

   sima_utils.transformer.tokenizer.whisper_tokenizer.Tokenizer


Functions
---------

.. autoapisummary::

   sima_utils.transformer.tokenizer.whisper_tokenizer.get_byte_decoder
   sima_utils.transformer.tokenizer.whisper_tokenizer.get_encoding
   sima_utils.transformer.tokenizer.whisper_tokenizer.get_tokenizer


Module Contents
---------------

.. py:data:: LANGUAGES

.. py:data:: TO_LANGUAGE_CODE

.. py:class:: Tokenizer

   A thin wrapper around `tiktoken` providing quick access to special tokens


   .. py:attribute:: encoding
      :type:  tiktoken.Encoding


   .. py:attribute:: num_languages
      :type:  int


   .. py:attribute:: language
      :type:  str | None
      :value: None


   .. py:attribute:: task
      :type:  str | None
      :value: None


   .. py:attribute:: sot_sequence
      :type:  tuple[int]
      :value: ()


   .. py:attribute:: special_tokens
      :type:  dict[str, int]


   .. py:method:: encode(text, **kwargs)


   .. py:method:: decode(token_ids: list[int], **kwargs) -> str


   .. py:method:: decode_with_timestamps(token_ids: list[int], **kwargs) -> str

      Timestamp tokens are above other special tokens' id range and are ignored by `decode()`.
      This method decodes given tokens with timestamps tokens annotated, e.g. "<|1.08|>".


   .. py:method:: decode_without_special_tokens(token_ids: list[int], **kwargs) -> str


   .. py:property:: eot
      :type: int


   .. py:property:: transcribe
      :type: int


   .. py:property:: translate
      :type: int


   .. py:property:: sot
      :type: int


   .. py:property:: sot_lm
      :type: int


   .. py:property:: sot_prev
      :type: int


   .. py:property:: no_speech
      :type: int


   .. py:property:: no_timestamps
      :type: int


   .. py:property:: timestamp_begin
      :type: int


   .. py:property:: language_token
      :type: int


      Returns the token id corresponding to the value of the `language` field


   .. py:method:: to_language_token(language: str) -> int


   .. py:property:: all_language_tokens
      :type: tuple[int]


   .. py:property:: all_language_codes
      :type: tuple[str]


   .. py:property:: sot_sequence_including_notimestamps
      :type: tuple[int]


   .. py:property:: non_speech_tokens
      :type: tuple[int]


      Returns the list of tokens to suppress in order to avoid any speaker tags or non-speech
      annotations, to prevent sampling texts that are not actually spoken in the audio, e.g.

      - ♪♪♪
      - ( SPEAKING FOREIGN LANGUAGE )
      - [DAVID] Hey there,

      keeping basic punctuations like commas, periods, question marks, exclamation points, etc.


   .. py:method:: split_to_word_tokens(tokens: list[int])


   .. py:method:: split_tokens_on_unicode(tokens: list[int])


   .. py:method:: split_tokens_on_spaces(tokens: list[int])


.. py:function:: get_byte_decoder()

   Returns list of utf-8 byte and a mapping to unicode strings. We specifically avoids mapping to
   whitespace/control characters the bpe code barfs on.

   The reversible bpe codes work on unicode strings. This means you need a large # of unicode
   characters in your vocab if you want to avoid UNKs. When you're at something like a 10B token
   dataset you end up needing around 5K for decent coverage. This is a significant percentage of
   your normal, say, 32K bpe vocab. To avoid that, we want lookup tables between utf-8 bytes and
   unicode strings.


.. py:function:: get_encoding(name: str = 'gpt2', num_languages: int = 99, hf_tokenizer_json_file: pathlib.Path | None = None)

.. py:function:: get_tokenizer(multilingual: bool, *, num_languages: int = 99, language: str | None = None, task: str | None = None, hf_tokenizer_json_file: pathlib.Path | None = None) -> Tokenizer