sima_utils.transformer.llm_tokenizer

Attributes

SENTENCEPIECE_CONFIG

TIKTOKEN_CONFIG

model_path

Classes

TokenizerType

Supported tokenizer type.

LlmTokenizer

LLM tokenizer.

Functions

spm_with_added_tokens(...)

Add extra tokens to a binary tokenizer model.

Module Contents

sima_utils.transformer.llm_tokenizer.SENTENCEPIECE_CONFIG
sima_utils.transformer.llm_tokenizer.TIKTOKEN_CONFIG
class sima_utils.transformer.llm_tokenizer.TokenizerType

Supported tokenizer type. spiece: used by llama-2, gemma-1,2,3 tiktoken: used by llama-3.1, 3.2

SPIECE = 'spiece'
TIKTOKEN = 'tiktoken'
sima_utils.transformer.llm_tokenizer.spm_with_added_tokens(model_path: str | pathlib.Path, added_tokens: dict[str, int]) sentencepiece.SentencePieceProcessor

Add extra tokens to a binary tokenizer model.

Parameters:
  • model_path – The path to the binary model file.

  • added_tokens – A dict of token and token_id to be added.

Returns:

A loaded spiece model with extended vocabulary.

class sima_utils.transformer.llm_tokenizer.LlmTokenizer(model_path: str | pathlib.Path, model_type: str, *, vlm_type: str, special_tokens: dict = {})

LLM tokenizer.

model_type

Type of tokenizer model.

vlm_type

Type of vision language model for which the tokenizer is used.

model

The loaded tokenizer model, either SentencePiece or TikToken.

vocab_size

Size of the vocabulary.

bos_id

Token ID for beginning-of-sequence.

eos_id

Token ID for end-of-sequence.

pad_id

Token ID for padding.

stop_tokens

A list of tokens to indicate stop of a chat.

special_tokens

A dict of special tokens for the VLM model.

model_type: str
vlm_type: str
model: sentencepiece.SentencePieceProcessor | tiktoken.core.Encoding
vocab_size: int
bos_id: int
eos_id: int
pad_id: int
stop_tokens: list[int]
special_tokens: dict
get_vocab_size() int
get_bos_id() int
get_eos_id() int
get_stop_tokens() list[int]
get_pad_id() int
encode(s: str, *, bos: bool, eos: bool) list[int]

Encodes a string into a list of token IDs.

Parameters:
  • s – Input string to be encoded.

  • bos – Flag to prepend the beginning-of-sequence token.

  • eos – Flag to append the end-of-sequence token.

Returns:

A list of token IDs.

Return type:

list[int]

decode(t: list[int], remove_special_tokens: bool = True) str

Decodes a list of token IDs into a string.

Parameters:
  • t – List of token IDs to be decoded.

  • remove_special_tokens – A boolean flag to remove special tokens.

Returns:

The decoded string.

Return type:

str

decode_single_token(t: int, remove_special_tokens: bool = True) str | bytes
sima_utils.transformer.llm_tokenizer.model_path