sima_utils.transformer.llm_tokenizer
Attributes
Classes
Supported tokenizer type. |
|
LLM tokenizer. |
Functions
Add extra tokens to a binary tokenizer model. |
Module Contents
- sima_utils.transformer.llm_tokenizer.SENTENCEPIECE_CONFIG
- sima_utils.transformer.llm_tokenizer.TIKTOKEN_CONFIG
- class sima_utils.transformer.llm_tokenizer.TokenizerType
Supported tokenizer type. spiece: used by llama-2, gemma-1,2,3 tiktoken: used by llama-3.1, 3.2
- SPIECE = 'spiece'
- TIKTOKEN = 'tiktoken'
- sima_utils.transformer.llm_tokenizer.spm_with_added_tokens(model_path: str | pathlib.Path, added_tokens: dict[str, int]) sentencepiece.SentencePieceProcessor
Add extra tokens to a binary tokenizer model.
- Parameters:
model_path – The path to the binary model file.
added_tokens – A dict of token and token_id to be added.
- Returns:
A loaded spiece model with extended vocabulary.
- class sima_utils.transformer.llm_tokenizer.LlmTokenizer(model_path: str | pathlib.Path, model_type: str, *, vlm_type: str, special_tokens: dict = {})
LLM tokenizer.
- model_type
Type of tokenizer model.
- vlm_type
Type of vision language model for which the tokenizer is used.
- model
The loaded tokenizer model, either SentencePiece or TikToken.
- vocab_size
Size of the vocabulary.
- bos_id
Token ID for beginning-of-sequence.
- eos_id
Token ID for end-of-sequence.
- pad_id
Token ID for padding.
- stop_tokens
A list of tokens to indicate stop of a chat.
- special_tokens
A dict of special tokens for the VLM model.
- model_type: str
- vlm_type: str
- model: sentencepiece.SentencePieceProcessor | tiktoken.core.Encoding
- vocab_size: int
- bos_id: int
- eos_id: int
- pad_id: int
- stop_tokens: list[int]
- special_tokens: dict
- get_vocab_size() int
- get_bos_id() int
- get_eos_id() int
- get_stop_tokens() list[int]
- get_pad_id() int
- encode(s: str, *, bos: bool, eos: bool) list[int]
Encodes a string into a list of token IDs.
- Parameters:
s – Input string to be encoded.
bos – Flag to prepend the beginning-of-sequence token.
eos – Flag to append the end-of-sequence token.
- Returns:
A list of token IDs.
- Return type:
list[int]
- decode(t: list[int], remove_special_tokens: bool = True) str
Decodes a list of token IDs into a string.
- Parameters:
t – List of token IDs to be decoded.
remove_special_tokens – A boolean flag to remove special tokens.
- Returns:
The decoded string.
- Return type:
str
- decode_single_token(t: int, remove_special_tokens: bool = True) str | bytes
- sima_utils.transformer.llm_tokenizer.model_path