sima_utils.transformer.llm_tokenizer
====================================

.. py:module:: sima_utils.transformer.llm_tokenizer


Attributes
----------

.. autoapisummary::

   sima_utils.transformer.llm_tokenizer.SENTENCEPIECE_CONFIG
   sima_utils.transformer.llm_tokenizer.TIKTOKEN_CONFIG
   sima_utils.transformer.llm_tokenizer.model_path


Classes
-------

.. autoapisummary::

   sima_utils.transformer.llm_tokenizer.TokenizerType
   sima_utils.transformer.llm_tokenizer.LlmTokenizer


Functions
---------

.. autoapisummary::

   sima_utils.transformer.llm_tokenizer.spm_with_added_tokens


Module Contents
---------------

.. py:data:: SENTENCEPIECE_CONFIG

.. py:data:: TIKTOKEN_CONFIG

.. py:class:: TokenizerType


   Supported tokenizer type.
   spiece: used by llama-2, gemma-1,2,3
   tiktoken: used by llama-3.1, 3.2


   .. py:attribute:: SPIECE
      :value: 'spiece'


   .. py:attribute:: TIKTOKEN
      :value: 'tiktoken'


.. py:function:: spm_with_added_tokens(model_path: str | pathlib.Path, added_tokens: dict[str, int]) -> sentencepiece.SentencePieceProcessor

   Add extra tokens to a binary tokenizer model.

   :param model_path: The path to the binary model file.
   :param added_tokens: A dict of token and token_id to be added.

   :returns: A loaded spiece model with extended vocabulary.


.. py:class:: LlmTokenizer(model_path: str | pathlib.Path, model_type: str, *, vlm_type: str, special_tokens: dict = {})

   LLM tokenizer.

   .. attribute:: model_type

      Type of tokenizer model.

   .. attribute:: vlm_type

      Type of vision language model for which the tokenizer is used.

   .. attribute:: model

      The loaded tokenizer model, either SentencePiece or TikToken.

   .. attribute:: vocab_size

      Size of the vocabulary.

   .. attribute:: bos_id

      Token ID for beginning-of-sequence.

   .. attribute:: eos_id

      Token ID for end-of-sequence.

   .. attribute:: pad_id

      Token ID for padding.

   .. attribute:: stop_tokens

      A list of tokens to indicate stop of a chat.

   .. attribute:: special_tokens

      A dict of special tokens for the VLM model.


   .. py:attribute:: model_type
      :type:  str


   .. py:attribute:: vlm_type
      :type:  str


   .. py:attribute:: model
      :type:  sentencepiece.SentencePieceProcessor | tiktoken.core.Encoding


   .. py:attribute:: vocab_size
      :type:  int


   .. py:attribute:: bos_id
      :type:  int


   .. py:attribute:: eos_id
      :type:  int


   .. py:attribute:: pad_id
      :type:  int


   .. py:attribute:: stop_tokens
      :type:  list[int]


   .. py:attribute:: special_tokens
      :type:  dict


   .. py:method:: get_vocab_size() -> int


   .. py:method:: get_bos_id() -> int


   .. py:method:: get_eos_id() -> int


   .. py:method:: get_stop_tokens() -> list[int]


   .. py:method:: get_pad_id() -> int


   .. py:method:: encode(s: str, *, bos: bool, eos: bool) -> list[int]

      Encodes a string into a list of token IDs.

      :param s: Input string to be encoded.
      :param bos: Flag to prepend the beginning-of-sequence token.
      :param eos: Flag to append the end-of-sequence token.

      :returns: A list of token IDs.
      :rtype: list[int]


   .. py:method:: decode(t: list[int], remove_special_tokens: bool = True) -> str

      Decodes a list of token IDs into a string.

      :param t: List of token IDs to be decoded.
      :param remove_special_tokens: A boolean flag to remove special tokens.

      :returns: The decoded string.
      :rtype: str


   .. py:method:: decode_single_token(t: int, remove_special_tokens: bool = True) -> str | bytes


.. py:data:: model_path