sima_utils.transformer.vlm_config

Attributes

`MLA_CONSTRAINTS`
`SPECIAL_TOKENS`
`input_path`

Classes

`ExtendedEnum`	Add class methods to Enum.
`VlmArchType`	VLM architecture type.
`VisionArchType`	Vision architecture type.
`LlmArchType`	LLM architecture type.
`LlmArchVersion`	LLM architecture version.
`LlmDataType`	LLM data type.
`GgufDataType`	LLM data type format by GGUF.
`BaseConfig`	Base configuration with an update method.
`VisionModelConfig`	Configuration of Vision Encoder.
`MMConnectionConfig`	Configuration of Multi-Modal Connection.
`TokenEmbedConfig`	Configuration of tokenizer and embedding.
`RopeScalingConfig`	Configuration of RoPE Scaling.
`RoPEConfig`	Configuration of Rotary Position Embedding.
`AttentionBlockConfig`	Configuration of attention block.
`MlpBlockConfig`	Configuration of MLP block.
`LanguageModelConfig`	Configuration of LLM.
`PipelineConfig`	Configuration of VLM pipeline.
`VlmConfig`	Configuration of Vision Language Model.
`VlmHelper`	VLM helper class with processors.

Functions

`get_model_arch_gen`(→ tuple[VisionArchType \| None, ...)	Derive VLM architecture and version from the model arch and type.
`get_model_version_from_intermediate_size`(→ LlmArchVersion)	Get the version of a model by matching intermediate_size.
`get_model_size_from_hidden_parameters`(→ str)	Get the size of a model by matching hidden_size and number of layers.
`llm_parameter_count`(→ int)	Calculate parameter count of an LLM model.
`apply_mla_constraint`(→ None)	Apply MLA constraints.

Module Contents

class sima_utils.transformer.vlm_config.ExtendedEnum

Add class methods to Enum.

classmethod values()

classmethod names()

class sima_utils.transformer.vlm_config.VlmArchType

VLM architecture type.

VLM_LLAVA = 'vlm-llava'

VLM_PALIGEMMA = 'vlm-paligemma'

VLM_GEMMA3 = 'vlm-gemma3'

VLM_CUSTOM = 'vlm-custom'

LLM_LLAMA2 = 'llm-llama2'

LLM_LLAMA3_1 = 'llm-llama3.1'

LLM_LLAMA3_2 = 'llm-llama3.2'

LLM_GEMMA1 = 'llm-gemma1'

LLM_GEMMA2 = 'llm-gemma2'

LLM_GEMMA3 = 'llm-gemma3'

LLM_PHI3_5 = 'llm-phi3.5'

class sima_utils.transformer.vlm_config.VisionArchType

Vision architecture type.

CLIP = 'clip'

SIGLIP = 'siglip'

class sima_utils.transformer.vlm_config.LlmArchType

LLM architecture type.

LLAMA = 'llama'

GEMMA = 'gemma'

PHI = 'phi'

class sima_utils.transformer.vlm_config.LlmArchVersion

LLM architecture version.

GEN_1 = '1'

GEN_2 = '2'

GEN_3 = '3'

GEN_3_1 = '3.1'

GEN_3_2 = '3.2'

GEN_3_5 = '3.5'

class sima_utils.transformer.vlm_config.LlmDataType

LLM data type.

F32 = 'float32'

F16 = 'float16'

BF16 = 'bfloat16'

class sima_utils.transformer.vlm_config.GgufDataType

LLM data type format by GGUF.

F32 = 0

F16 = 1

sima_utils.transformer.vlm_config.MLA_CONSTRAINTS: dict

class sima_utils.transformer.vlm_config.BaseConfig

Base configuration with an update method.

update(cfg)

class sima_utils.transformer.vlm_config.VisionModelConfig

Configuration of Vision Encoder.

model_type: The type of the model.

image_size: The resolution of input images.

patch_size: The patch size to divide an image.

hidden_size: The dimension of embedding.

intermediate_size: The dimension of MLP layer.

num_attention_heads: The numbder of attention heads.

num_hidden_layers: The number of transformer blocks.

hidden_act: The type of activation in MLP.

layer_norm_eps: The small value to prevent division by zero.

model_type: str = ''

arch: VisionArchType

image_size: int = 0

patch_size: int = 0

cls_embed: bool = False

hidden_size: int = 0

intermediate_size: int = 0

num_attention_heads: int = 0

num_hidden_layers: int = 0

hidden_act: str = 'gelu_pytorch_tanh'

layer_norm_eps: float = 1e-06

set_default_config(arch: VisionArchType)

property num_patches: int

property seq_len: int

class sima_utils.transformer.vlm_config.MMConnectionConfig

Configuration of Multi-Modal Connection.

The MM connection consists of 1 or 2 linear layers.

num_layers: The number of linear layers in the connection.

hidden_act: The type of activation if num_layers is 2.

mm_tokens_per_image: The number of tokens projected for each image. If mm_tokens_per_image is less than num_patches, AvgPool is inserted.

proj_dim: The number of projected tokens for an image.

num_layers: int = 2

hidden_act: str = 'gelu'

mm_tokens_per_image: int = 0

proj_dim: int = 0

set_default_config(vm_arch: VisionArchType, num_patches: int, lm_hidden_size: int)

sima_utils.transformer.vlm_config.SPECIAL_TOKENS = ['ignore_index', 'image_token_index', 'boi_token_index', 'eoi_token_index', 'bos_token_id',...

class sima_utils.transformer.vlm_config.TokenEmbedConfig

Configuration of tokenizer and embedding.

tokenizer_type: The type of tokenizer.

tokenizer_path: The path of the tokenizer model.

vocab_size: The size of vocabulary of the tokenizer.

special_tokens: The dict of specail tokens.

tokenizer_type: str = ''

tokenizer_path: str = ''

vocab_size: int = 0

special_tokens: dict

add_special_token(name, value)

class sima_utils.transformer.vlm_config.RopeScalingConfig

Configuration of RoPE Scaling.

factor: The scaling factor.

low_freq_factor: The low frequency factor (llama3).

high_freq_factor: The high frequency factor (llama3).

original_max_position_embeddings: The original context length used in model training with the given RoPS settings.

long_factor: List of scaling factors for long context (longrope).

short_factor: List of scaling factors for short context (longrope).

rope_type: The type of RoPE scaling method. Supported types are “linear” or “default”, “llama3”, and “longrope”.

factor: float = 1.0

low_freq_factor: float = 0

high_freq_factor: float = 0

original_max_position_embeddings: int = 0

long_factor: list[float] | None = None

short_factor: list[float] | None = None

rope_type: str = 'default'

class sima_utils.transformer.vlm_config.RoPEConfig

Configuration of Rotary Position Embedding.

rope_theta: The theta for RoPE.

rope_local_base_freq: The local base frequency.

rope_scaling: The settings for RoPE scaling.

rope_theta: float = 10000

rope_local_base_freq: float = 10000

rope_scaling: RopeScalingConfig

class sima_utils.transformer.vlm_config.AttentionBlockConfig

Configuration of attention block.

num_attention_heads: The number of attention heads.

num_key_value_heads: The number of key/value heads.

head_dim: The dimension of query, key, and value heads.

swa_enable: The flag to turn on sliding window attention.

swa_ratio: The ratio of SWA over global attention.

sliding_window: The size of sliding window for SWA.

attention_bias: Reserved for future.

attention_dropout: Reserved for future.

query_pre_attn_scalar: Reserved for future.

num_attention_heads: int = 0

num_key_value_heads: int = 0

head_dim: int = 0

swa_enable: bool = False

swa_ratio: int = 0

sliding_window: int = 0

attention_bias: bool = False

attention_dropout: float = 0.0

query_pre_attn_scalar: int = 0

property q_size: int

property kv_size: int

class sima_utils.transformer.vlm_config.MlpBlockConfig

Configuration of MLP block.

intermediate_size: The dimension of MLP layer.

act: The type of activation.

num_layers: The number of layers in MLP.

mlp_bias: Reserved for future use.

intermediate_size: int = 0

act: str = 'silu'

num_layers: int = 3

mlp_bias: bool = False

class sima_utils.transformer.vlm_config.LanguageModelConfig

Configuration of LLM.

model_type: The type of LLM.

data_type: The data type.

arch: The architecture of the LLM.

gen: The generation version of the LLM.

size: The billion-designation of the LLM.

token_cfg: The settings of tokenizer.

rope_cfg: The settings of RoPE.

attn_cfg: The settings of attention block.

mlp_cfg: The settings of MLP block.

hidden_size: The dimension of the embedding.

num_hidden_layers: The number of transformer blocks.

max_position_embeddings: The context length.

rms_norm_eps: The small value in RMS norm to prevent zero devision.

layer_norms: Types of RMS norm used in a transformer block.

attn_logit_softcapping: Gemma 2 attention logit soft capping.

final_logit_softcapping: Gemma 2 final logit soft capping.

model_type: str = ''

data_type: LlmDataType

arch: LlmArchType

gen: LlmArchVersion

size: str = '7b'

token_cfg: TokenEmbedConfig

rope_cfg: RoPEConfig

attn_cfg: AttentionBlockConfig

mlp_cfg: MlpBlockConfig

hidden_size: int = 0

num_hidden_layers: int = 0

max_position_embeddings: int = 0

rms_norm_eps: float = 1e-05

layer_norms: list[str] = []

attn_logit_softcapping: float | None = None

final_logit_softcapping: float | None = None

lm_head_num_splits: int = 1

lm_head_split_dim: int = 0

static load(cfg: dict) → LanguageModelConfig

update(cfg: dict)

class sima_utils.transformer.vlm_config.PipelineConfig

Configuration of VLM pipeline.

system_prompt: The system prompt.

max_num_tokens: The max number of tokens including the input and generated tokens.

input_token_group_size: The group size of input token.

input_token_group_offsets: The group offsets of input token.

system_prompt: str | None = None

max_num_tokens: int = 1024

input_token_group_size: int = 1

input_token_group_offsets: list[int] | None = None

future_token_mask_size: int = 1

set_system_prompt(prompt: str | None)

set_max_num_tokens(max_num_tokens: int)

set_group_size(size: int)

set_group_offsets(offsets: list[int])

set_future_token_mask_size(mask_size: int)

class sima_utils.transformer.vlm_config.VlmConfig

Configuration of Vision Language Model.

model_name

The name of the model.

Type:: str

model_type

The type of the model.

Type:: str

vm_cfg

The settings of vision model.

Type:: VisionModelConfig | None

mm_cfg

The settings of multi-modal connection.

Type:: MMConnectionConfig | None

lm_cfg

The settings of language model.

Type:: LanguageModelConfig

pipeline_cfg

The settings of application pipeline.

Type:: PipelineConfig

model_name: str = ''

model_type: VlmArchType | None = None

vm_cfg: VisionModelConfig | None = None

mm_cfg: MMConnectionConfig | None = None

lm_cfg: LanguageModelConfig

pipeline_cfg: PipelineConfig

static load(vlm_cfg: dict) → VlmConfig

set_default_config(dtype: LlmDataType, vm_arch: VisionArchType | None, lm_arch: LlmArchType, gen: LlmArchVersion, b_size: str)

set_tokenizer_path(tokenizer_path: pathlib.Path)

static from_hf_config(model_path: pathlib.Path, model_cfg: dict) → VlmConfig

Generate SiMa’s configuration for VLM: from a HuggingFace config dict and MLA constraints.

Parameters:

model_path – The path of the source model.
model_cfg – The config dict of the source model.

Returns:

VlmConfig for the model.

property is_multimodal

update_special_tokens(cfg: dict)

update_vision_model_params(cfg: dict)

update_mm_connection_params(cfg: dict)

update_language_model_params(cfg: dict)

config_pipeline(system_prompt: str | None, max_num_tokens: int, tokenizer: sima_utils.transformer.llm_tokenizer.LlmTokenizer, estimated_max_num_query_tokens: int = 100)

class sima_utils.transformer.vlm_config.VlmHelper(vlm_cfg: VlmConfig, system_prompt: str | None = None)

VLM helper class with processors.

tokenizer: sima_utils.transformer.llm_tokenizer.LlmTokenizer

prompt_formatter: sima_utils.transformer.prompt_template.PromptFormatter

image_preprocessor: sima_utils.transformer.vision_preprocessor.ImageProcessor | None

preprocess(query: str, image: pathlib.Path | str | numpy.ndarray | None) → tuple[str, numpy.ndarray, numpy.ndarray | None]

Preprocess the input query and the image.

Parameters:

query – Input query string.
image – Path to the image or the loaded image in numpy array. Set to None if no image.

Returns:

Tuple of formatted prompt, tokenized input query and preprocessed image.

postprocess(output_tokens: numpy.ndarray | list[int]) → str

sima_utils.transformer.vlm_config.get_model_arch_gen(model_arch: str, model_type: str, text_type: str) → tuple[VisionArchType | None, LlmArchType, LlmArchVersion | None]

Derive VLM architecture and version from the model arch and type.

The model_arch will contain “ForConditionalGeneration” for VLM and “ForCausalLM” for LM. The model_type may contain version information.

Parameters:

model_arch – The architecture of a model.
model_type – The type of a model.
text_type – The type of language model.

Returns:

Tuple of vision architecture, LLM architecture and version.

sima_utils.transformer.vlm_config.get_model_version_from_intermediate_size(arch: LlmArchType, cfg: dict) → LlmArchVersion

Get the version of a model by matching intermediate_size.

Parameters:

arch – The architecture type of a model.
cfg – The configuration dictionary of the model.

Returns:

The version of the model in its architecture family.

sima_utils.transformer.vlm_config.get_model_size_from_hidden_parameters(arch: LlmArchType, gen: LlmArchVersion, cfg: dict) → str

Get the size of a model by matching hidden_size and number of layers.

Parameters:

arch – The architecture type of a model.
gen – The version of the model within the architecture family.
cfg – The configuration dictionary of the model.

Returns:

The size of the model as string in billion-designation.

Raises:

Value error if no matching hidden_size and number of layers are found. –

sima_utils.transformer.vlm_config.llm_parameter_count(arch: LlmArchType, gen: LlmArchVersion, size_b: int, cfg: dict) → int

Calculate parameter count of an LLM model.

Parameters:

arch (str) – The architecture name of a model.
gen (str) – The version of the model within the architecture family.
size_b (str) – The billion-size designation of the model.
cfg (dict) – The configuration dictionary of the model.

Returns:

The size of the model in terms of the parameter count.

sima_utils.transformer.vlm_config.apply_mla_constraint(vlm_cfg: VlmConfig) → None

Apply MLA constraints.

TODO: modify RoPE if context_length (max_position_embeddings) is changed.

Parameters:: vlm_cfg (VlmConfig) – The configuration of VLM model.
Returns:: None. Change is made in place.

sima_utils.transformer.vlm_config.input_path