sima_utils.transformer.llm_tokenizer ==================================== .. py:module:: sima_utils.transformer.llm_tokenizer Attributes ---------- .. autoapisummary:: sima_utils.transformer.llm_tokenizer.SENTENCEPIECE_CONFIG sima_utils.transformer.llm_tokenizer.TIKTOKEN_CONFIG sima_utils.transformer.llm_tokenizer.model_path Classes ------- .. autoapisummary:: sima_utils.transformer.llm_tokenizer.TokenizerType sima_utils.transformer.llm_tokenizer.LlmTokenizer Functions --------- .. autoapisummary:: sima_utils.transformer.llm_tokenizer.spm_with_added_tokens Module Contents --------------- .. py:data:: SENTENCEPIECE_CONFIG .. py:data:: TIKTOKEN_CONFIG .. py:class:: TokenizerType Supported tokenizer type. spiece: used by llama-2, gemma-1,2,3 tiktoken: used by llama-3.1, 3.2 .. py:attribute:: SPIECE :value: 'spiece' .. py:attribute:: TIKTOKEN :value: 'tiktoken' .. py:function:: spm_with_added_tokens(model_path: str | pathlib.Path, added_tokens: dict[str, int]) -> sentencepiece.SentencePieceProcessor Add extra tokens to a binary tokenizer model. :param model_path: The path to the binary model file. :param added_tokens: A dict of token and token_id to be added. :returns: A loaded spiece model with extended vocabulary. .. py:class:: LlmTokenizer(model_path: str | pathlib.Path, model_type: str, *, vlm_type: str, special_tokens: dict = {}) LLM tokenizer. .. attribute:: model_type Type of tokenizer model. .. attribute:: vlm_type Type of vision language model for which the tokenizer is used. .. attribute:: model The loaded tokenizer model, either SentencePiece or TikToken. .. attribute:: vocab_size Size of the vocabulary. .. attribute:: bos_id Token ID for beginning-of-sequence. .. attribute:: eos_id Token ID for end-of-sequence. .. attribute:: pad_id Token ID for padding. .. attribute:: stop_tokens A list of tokens to indicate stop of a chat. .. attribute:: special_tokens A dict of special tokens for the VLM model. .. py:attribute:: model_type :type: str .. py:attribute:: vlm_type :type: str .. py:attribute:: model :type: sentencepiece.SentencePieceProcessor | tiktoken.core.Encoding .. py:attribute:: vocab_size :type: int .. py:attribute:: bos_id :type: int .. py:attribute:: eos_id :type: int .. py:attribute:: pad_id :type: int .. py:attribute:: stop_tokens :type: list[int] .. py:attribute:: special_tokens :type: dict .. py:method:: get_vocab_size() -> int .. py:method:: get_bos_id() -> int .. py:method:: get_eos_id() -> int .. py:method:: get_stop_tokens() -> list[int] .. py:method:: get_pad_id() -> int .. py:method:: encode(s: str, *, bos: bool, eos: bool) -> list[int] Encodes a string into a list of token IDs. :param s: Input string to be encoded. :param bos: Flag to prepend the beginning-of-sequence token. :param eos: Flag to append the end-of-sequence token. :returns: A list of token IDs. :rtype: list[int] .. py:method:: decode(t: list[int], remove_special_tokens: bool = True) -> str Decodes a list of token IDs into a string. :param t: List of token IDs to be decoded. :param remove_special_tokens: A boolean flag to remove special tokens. :returns: The decoded string. :rtype: str .. py:method:: decode_single_token(t: int, remove_special_tokens: bool = True) -> str | bytes .. py:data:: model_path