sima_utils.transformer.vlm_config

Attributes

MLA_CONSTRAINTS

SPECIAL_TOKENS

input_path

Classes

ExtendedEnum

Add class methods to Enum.

VlmArchType

VLM architecture type.

VisionArchType

Vision architecture type.

LlmArchType

LLM architecture type.

LlmArchVersion

LLM architecture version.

LlmDataType

LLM data type.

GgufDataType

LLM data type format by GGUF.

BaseConfig

Base configuration with an update method.

VisionModelConfig

Configuration of Vision Encoder.

MMConnectionConfig

Configuration of Multi-Modal Connection.

TokenEmbedConfig

Configuration of tokenizer and embedding.

RopeScalingConfig

Configuration of RoPE Scaling.

RoPEConfig

Configuration of Rotary Position Embedding.

AttentionBlockConfig

Configuration of attention block.

MlpBlockConfig

Configuration of MLP block.

LanguageModelConfig

Configuration of LLM.

PipelineConfig

Configuration of VLM pipeline.

VlmConfig

Configuration of Vision Language Model.

VlmHelper

VLM helper class with processors.

Functions

get_model_arch_gen(β†’Β tuple[VisionArchTypeΒ |Β None,Β ...)

Derive VLM architecture and version from the model arch and type.

get_model_version_from_intermediate_size(β†’Β LlmArchVersion)

Get the version of a model by matching intermediate_size.

get_model_size_from_hidden_parameters(β†’Β str)

Get the size of a model by matching hidden_size and number of layers.

llm_parameter_count(β†’Β int)

Calculate parameter count of an LLM model.

apply_mla_constraint(β†’Β None)

Apply MLA constraints.

Module Contents

class sima_utils.transformer.vlm_config.ExtendedEnum

Add class methods to Enum.

classmethod values()
classmethod names()
class sima_utils.transformer.vlm_config.VlmArchType

VLM architecture type.

VLM_LLAVA = 'vlm-llava'
VLM_PALIGEMMA = 'vlm-paligemma'
VLM_GEMMA3 = 'vlm-gemma3'
VLM_CUSTOM = 'vlm-custom'
LLM_LLAMA2 = 'llm-llama2'
LLM_LLAMA3_1 = 'llm-llama3.1'
LLM_LLAMA3_2 = 'llm-llama3.2'
LLM_GEMMA1 = 'llm-gemma1'
LLM_GEMMA2 = 'llm-gemma2'
LLM_GEMMA3 = 'llm-gemma3'
LLM_PHI3_5 = 'llm-phi3.5'
class sima_utils.transformer.vlm_config.VisionArchType

Vision architecture type.

CLIP = 'clip'
SIGLIP = 'siglip'
class sima_utils.transformer.vlm_config.LlmArchType

LLM architecture type.

LLAMA = 'llama'
GEMMA = 'gemma'
PHI = 'phi'
class sima_utils.transformer.vlm_config.LlmArchVersion

LLM architecture version.

GEN_1 = '1'
GEN_2 = '2'
GEN_3 = '3'
GEN_3_1 = '3.1'
GEN_3_2 = '3.2'
GEN_3_5 = '3.5'
class sima_utils.transformer.vlm_config.LlmDataType

LLM data type.

F32 = 'float32'
F16 = 'float16'
BF16 = 'bfloat16'
class sima_utils.transformer.vlm_config.GgufDataType

LLM data type format by GGUF.

F32 = 0
F16 = 1
sima_utils.transformer.vlm_config.MLA_CONSTRAINTS: dict
class sima_utils.transformer.vlm_config.BaseConfig

Base configuration with an update method.

update(cfg)
class sima_utils.transformer.vlm_config.VisionModelConfig

Configuration of Vision Encoder.

model_type

The type of the model.

image_size

The resolution of input images.

patch_size

The patch size to divide an image.

hidden_size

The dimension of embedding.

intermediate_size

The dimension of MLP layer.

num_attention_heads

The numbder of attention heads.

num_hidden_layers

The number of transformer blocks.

hidden_act

The type of activation in MLP.

layer_norm_eps

The small value to prevent division by zero.

model_type: str = ''
arch: VisionArchType
image_size: int = 0
patch_size: int = 0
cls_embed: bool = False
hidden_size: int = 0
intermediate_size: int = 0
num_attention_heads: int = 0
num_hidden_layers: int = 0
hidden_act: str = 'gelu_pytorch_tanh'
layer_norm_eps: float = 1e-06
set_default_config(arch: VisionArchType)
property num_patches: int
property seq_len: int
class sima_utils.transformer.vlm_config.MMConnectionConfig

Configuration of Multi-Modal Connection.

The MM connection consists of 1 or 2 linear layers.

num_layers

The number of linear layers in the connection.

hidden_act

The type of activation if num_layers is 2.

mm_tokens_per_image

The number of tokens projected for each image. If mm_tokens_per_image is less than num_patches, AvgPool is inserted.

proj_dim

The number of projected tokens for an image.

num_layers: int = 2
hidden_act: str = 'gelu'
mm_tokens_per_image: int = 0
proj_dim: int = 0
set_default_config(vm_arch: VisionArchType, num_patches: int, lm_hidden_size: int)
sima_utils.transformer.vlm_config.SPECIAL_TOKENS = ['ignore_index', 'image_token_index', 'boi_token_index', 'eoi_token_index', 'bos_token_id',...
class sima_utils.transformer.vlm_config.TokenEmbedConfig

Configuration of tokenizer and embedding.

tokenizer_type

The type of tokenizer.

tokenizer_path

The path of the tokenizer model.

vocab_size

The size of vocabulary of the tokenizer.

special_tokens

The dict of specail tokens.

tokenizer_type: str = ''
tokenizer_path: str = ''
vocab_size: int = 0
special_tokens: dict
add_special_token(name, value)
class sima_utils.transformer.vlm_config.RopeScalingConfig

Configuration of RoPE Scaling.

factor

The scaling factor.

low_freq_factor

The low frequency factor (llama3).

high_freq_factor

The high frequency factor (llama3).

original_max_position_embeddings

The original context length used in model training with the given RoPS settings.

long_factor

List of scaling factors for long context (longrope).

short_factor

List of scaling factors for short context (longrope).

rope_type

The type of RoPE scaling method. Supported types are β€œlinear” or β€œdefault”, β€œllama3”, and β€œlongrope”.

factor: float = 1.0
low_freq_factor: float = 0
high_freq_factor: float = 0
original_max_position_embeddings: int = 0
long_factor: list[float] | None = None
short_factor: list[float] | None = None
rope_type: str = 'default'
class sima_utils.transformer.vlm_config.RoPEConfig

Configuration of Rotary Position Embedding.

rope_theta

The theta for RoPE.

rope_local_base_freq

The local base frequency.

rope_scaling

The settings for RoPE scaling.

rope_theta: float = 10000
rope_local_base_freq: float = 10000
rope_scaling: RopeScalingConfig
class sima_utils.transformer.vlm_config.AttentionBlockConfig

Configuration of attention block.

num_attention_heads

The number of attention heads.

num_key_value_heads

The number of key/value heads.

head_dim

The dimension of query, key, and value heads.

swa_enable

The flag to turn on sliding window attention.

swa_ratio

The ratio of SWA over global attention.

sliding_window

The size of sliding window for SWA.

attention_bias

Reserved for future.

attention_dropout

Reserved for future.

query_pre_attn_scalar

Reserved for future.

num_attention_heads: int = 0
num_key_value_heads: int = 0
head_dim: int = 0
swa_enable: bool = False
swa_ratio: int = 0
sliding_window: int = 0
attention_bias: bool = False
attention_dropout: float = 0.0
query_pre_attn_scalar: int = 0
property q_size: int
property kv_size: int
class sima_utils.transformer.vlm_config.MlpBlockConfig

Configuration of MLP block.

intermediate_size

The dimension of MLP layer.

act

The type of activation.

num_layers

The number of layers in MLP.

mlp_bias

Reserved for future use.

intermediate_size: int = 0
act: str = 'silu'
num_layers: int = 3
mlp_bias: bool = False
class sima_utils.transformer.vlm_config.LanguageModelConfig

Configuration of LLM.

model_type

The type of LLM.

data_type

The data type.

arch

The architecture of the LLM.

gen

The generation version of the LLM.

size

The billion-designation of the LLM.

token_cfg

The settings of tokenizer.

rope_cfg

The settings of RoPE.

attn_cfg

The settings of attention block.

mlp_cfg

The settings of MLP block.

hidden_size

The dimension of the embedding.

num_hidden_layers

The number of transformer blocks.

max_position_embeddings

The context length.

rms_norm_eps

The small value in RMS norm to prevent zero devision.

layer_norms

Types of RMS norm used in a transformer block.

attn_logit_softcapping

Gemma 2 attention logit soft capping.

final_logit_softcapping

Gemma 2 final logit soft capping.

model_type: str = ''
data_type: LlmDataType
arch: LlmArchType
gen: LlmArchVersion
size: str = '7b'
token_cfg: TokenEmbedConfig
rope_cfg: RoPEConfig
attn_cfg: AttentionBlockConfig
mlp_cfg: MlpBlockConfig
hidden_size: int = 0
num_hidden_layers: int = 0
max_position_embeddings: int = 0
rms_norm_eps: float = 1e-05
layer_norms: list[str] = []
attn_logit_softcapping: float | None = None
final_logit_softcapping: float | None = None
lm_head_num_splits: int = 1
lm_head_split_dim: int = 0
static load(cfg: dict) LanguageModelConfig
update(cfg: dict)
class sima_utils.transformer.vlm_config.PipelineConfig

Configuration of VLM pipeline.

system_prompt

The system prompt.

max_num_tokens

The max number of tokens including the input and generated tokens.

input_token_group_size

The group size of input token.

input_token_group_offsets

The group offsets of input token.

system_prompt: str | None = None
max_num_tokens: int = 1024
input_token_group_size: int = 1
input_token_group_offsets: list[int] | None = None
future_token_mask_size: int = 1
set_system_prompt(prompt: str | None)
set_max_num_tokens(max_num_tokens: int)
set_group_size(size: int)
set_group_offsets(offsets: list[int])
set_future_token_mask_size(mask_size: int)
class sima_utils.transformer.vlm_config.VlmConfig

Configuration of Vision Language Model.

model_name

The name of the model.

Type:

str

model_type

The type of the model.

Type:

str

vm_cfg

The settings of vision model.

Type:

VisionModelConfig | None

mm_cfg

The settings of multi-modal connection.

Type:

MMConnectionConfig | None

lm_cfg

The settings of language model.

Type:

LanguageModelConfig

pipeline_cfg

The settings of application pipeline.

Type:

PipelineConfig

model_name: str = ''
model_type: VlmArchType | None = None
vm_cfg: VisionModelConfig | None = None
mm_cfg: MMConnectionConfig | None = None
lm_cfg: LanguageModelConfig
pipeline_cfg: PipelineConfig
static load(vlm_cfg: dict) VlmConfig
set_default_config(dtype: LlmDataType, vm_arch: VisionArchType | None, lm_arch: LlmArchType, gen: LlmArchVersion, b_size: str)
set_tokenizer_path(tokenizer_path: pathlib.Path)
static from_hf_config(model_path: pathlib.Path, model_cfg: dict) VlmConfig
Generate SiMa’s configuration for VLM

from a HuggingFace config dict and MLA constraints.

Parameters:
  • model_path – The path of the source model.

  • model_cfg – The config dict of the source model.

Returns:

VlmConfig for the model.

property is_multimodal
update_special_tokens(cfg: dict)
update_vision_model_params(cfg: dict)
update_mm_connection_params(cfg: dict)
update_language_model_params(cfg: dict)
config_pipeline(system_prompt: str | None, max_num_tokens: int, tokenizer: sima_utils.transformer.llm_tokenizer.LlmTokenizer, estimated_max_num_query_tokens: int = 100)
class sima_utils.transformer.vlm_config.VlmHelper(vlm_cfg: VlmConfig, system_prompt: str | None = None)

VLM helper class with processors.

tokenizer: sima_utils.transformer.llm_tokenizer.LlmTokenizer
prompt_formatter: sima_utils.transformer.prompt_template.PromptFormatter
image_preprocessor: sima_utils.transformer.vision_preprocessor.ImageProcessor | None
preprocess(query: str, image: pathlib.Path | str | numpy.ndarray | None) tuple[str, numpy.ndarray, numpy.ndarray | None]

Preprocess the input query and the image.

Parameters:
  • query – Input query string.

  • image – Path to the image or the loaded image in numpy array. Set to None if no image.

Returns:

Tuple of formatted prompt, tokenized input query and preprocessed image.

postprocess(output_tokens: numpy.ndarray | list[int]) str
sima_utils.transformer.vlm_config.get_model_arch_gen(model_arch: str, model_type: str, text_type: str) tuple[VisionArchType | None, LlmArchType, LlmArchVersion | None]

Derive VLM architecture and version from the model arch and type.

The model_arch will contain β€œForConditionalGeneration” for VLM and β€œForCausalLM” for LM. The model_type may contain version information.

Parameters:
  • model_arch – The architecture of a model.

  • model_type – The type of a model.

  • text_type – The type of language model.

Returns:

Tuple of vision architecture, LLM architecture and version.

sima_utils.transformer.vlm_config.get_model_version_from_intermediate_size(arch: LlmArchType, cfg: dict) LlmArchVersion

Get the version of a model by matching intermediate_size.

Parameters:
  • arch – The architecture type of a model.

  • cfg – The configuration dictionary of the model.

Returns:

The version of the model in its architecture family.

sima_utils.transformer.vlm_config.get_model_size_from_hidden_parameters(arch: LlmArchType, gen: LlmArchVersion, cfg: dict) str

Get the size of a model by matching hidden_size and number of layers.

Parameters:
  • arch – The architecture type of a model.

  • gen – The version of the model within the architecture family.

  • cfg – The configuration dictionary of the model.

Returns:

The size of the model as string in billion-designation.

Raises:

Value error if no matching hidden_size and number of layers are found. –

sima_utils.transformer.vlm_config.llm_parameter_count(arch: LlmArchType, gen: LlmArchVersion, size_b: int, cfg: dict) int

Calculate parameter count of an LLM model.

Parameters:
  • arch (str) – The architecture name of a model.

  • gen (str) – The version of the model within the architecture family.

  • size_b (str) – The billion-size designation of the model.

  • cfg (dict) – The configuration dictionary of the model.

Returns:

The size of the model in terms of the parameter count.

sima_utils.transformer.vlm_config.apply_mla_constraint(vlm_cfg: VlmConfig) None

Apply MLA constraints.

TODO: modify RoPE if context_length (max_position_embeddings) is changed.

Parameters:

vlm_cfg (VlmConfig) – The configuration of VLM model.

Returns:

None. Change is made in place.

sima_utils.transformer.vlm_config.input_path