sima_utils.transformer.llm_tokenizer

Attributes

`SENTENCEPIECE_CONFIG`
`TIKTOKEN_CONFIG`
`model_path`

Classes

`TokenizerType`	Supported tokenizer type.
`LlmTokenizer`	LLM tokenizer.

Functions

spm_with_added_tokens(...)

Add extra tokens to a binary tokenizer model.

Module Contents

sima_utils.transformer.llm_tokenizer.SENTENCEPIECE_CONFIG

sima_utils.transformer.llm_tokenizer.TIKTOKEN_CONFIG

class sima_utils.transformer.llm_tokenizer.TokenizerType

Supported tokenizer type. spiece: used by llama-2, gemma-1,2,3 tiktoken: used by llama-3.1, 3.2

SPIECE = 'spiece'

TIKTOKEN = 'tiktoken'

sima_utils.transformer.llm_tokenizer.spm_with_added_tokens(model_path: str | pathlib.Path, added_tokens: dict[str, int]) → sentencepiece.SentencePieceProcessor

Add extra tokens to a binary tokenizer model.

Parameters:

model_path – The path to the binary model file.
added_tokens – A dict of token and token_id to be added.

Returns:

A loaded spiece model with extended vocabulary.

class sima_utils.transformer.llm_tokenizer.LlmTokenizer(model_path: str | pathlib.Path, model_type: str, *, vlm_type: str, special_tokens: dict = {})

LLM tokenizer.

model_type: Type of tokenizer model.

vlm_type: Type of vision language model for which the tokenizer is used.

model: The loaded tokenizer model, either SentencePiece or TikToken.

vocab_size: Size of the vocabulary.

bos_id: Token ID for beginning-of-sequence.

eos_id: Token ID for end-of-sequence.

pad_id: Token ID for padding.

stop_tokens: A list of tokens to indicate stop of a chat.

special_tokens: A dict of special tokens for the VLM model.

model_type: str

vlm_type: str

model: sentencepiece.SentencePieceProcessor | tiktoken.core.Encoding

vocab_size: int

bos_id: int

eos_id: int

pad_id: int

stop_tokens: list[int]

special_tokens: dict

get_vocab_size() → int

get_bos_id() → int

get_eos_id() → int

get_stop_tokens() → list[int]

get_pad_id() → int

encode(s: str, *, bos: bool, eos: bool) → list[int]

Encodes a string into a list of token IDs.

Parameters:

s – Input string to be encoded.
bos – Flag to prepend the beginning-of-sequence token.
eos – Flag to append the end-of-sequence token.

Returns:

A list of token IDs.

Return type:

list[int]

decode(t: list[int], remove_special_tokens: bool = True) → str

Decodes a list of token IDs into a string.

Parameters:

t – List of token IDs to be decoded.
remove_special_tokens – A boolean flag to remove special tokens.

Returns:

The decoded string.

Return type:

str

decode_single_token(t: int, remove_special_tokens: bool = True) → str | bytes

sima_utils.transformer.llm_tokenizer.model_path