sima_utils.transformer.vlm_configο
Attributesο
Classesο
Add class methods to Enum. |
|
VLM architecture type. |
|
Vision architecture type. |
|
LLM architecture type. |
|
LLM architecture version. |
|
LLM data type. |
|
LLM data type format by GGUF. |
|
Base configuration with an update method. |
|
Configuration of Vision Encoder. |
|
Configuration of Multi-Modal Connection. |
|
Configuration of tokenizer and embedding. |
|
Configuration of RoPE Scaling. |
|
Configuration of Rotary Position Embedding. |
|
Configuration of attention block. |
|
Configuration of MLP block. |
|
Configuration of LLM. |
|
Configuration of VLM pipeline. |
|
Configuration of Vision Language Model. |
|
VLM helper class with processors. |
Functionsο
|
Derive VLM architecture and version from the model arch and type. |
|
Get the version of a model by matching intermediate_size. |
|
Get the size of a model by matching hidden_size and number of layers. |
|
Calculate parameter count of an LLM model. |
|
Apply MLA constraints. |
Module Contentsο
- class sima_utils.transformer.vlm_config.ExtendedEnumο
Add class methods to Enum.
- classmethod values()ο
- classmethod names()ο
- class sima_utils.transformer.vlm_config.VlmArchTypeο
VLM architecture type.
- VLM_LLAVA = 'vlm-llava'ο
- VLM_PALIGEMMA = 'vlm-paligemma'ο
- VLM_GEMMA3 = 'vlm-gemma3'ο
- VLM_CUSTOM = 'vlm-custom'ο
- LLM_LLAMA2 = 'llm-llama2'ο
- LLM_LLAMA3_1 = 'llm-llama3.1'ο
- LLM_LLAMA3_2 = 'llm-llama3.2'ο
- LLM_GEMMA1 = 'llm-gemma1'ο
- LLM_GEMMA2 = 'llm-gemma2'ο
- LLM_GEMMA3 = 'llm-gemma3'ο
- LLM_PHI3_5 = 'llm-phi3.5'ο
- class sima_utils.transformer.vlm_config.VisionArchTypeο
Vision architecture type.
- CLIP = 'clip'ο
- SIGLIP = 'siglip'ο
- class sima_utils.transformer.vlm_config.LlmArchTypeο
LLM architecture type.
- LLAMA = 'llama'ο
- GEMMA = 'gemma'ο
- PHI = 'phi'ο
- class sima_utils.transformer.vlm_config.LlmArchVersionο
LLM architecture version.
- GEN_1 = '1'ο
- GEN_2 = '2'ο
- GEN_3 = '3'ο
- GEN_3_1 = '3.1'ο
- GEN_3_2 = '3.2'ο
- GEN_3_5 = '3.5'ο
- class sima_utils.transformer.vlm_config.LlmDataTypeο
LLM data type.
- F32 = 'float32'ο
- F16 = 'float16'ο
- BF16 = 'bfloat16'ο
- class sima_utils.transformer.vlm_config.GgufDataTypeο
LLM data type format by GGUF.
- F32 = 0ο
- F16 = 1ο
- sima_utils.transformer.vlm_config.MLA_CONSTRAINTS: dictο
- class sima_utils.transformer.vlm_config.BaseConfigο
Base configuration with an update method.
- update(cfg)ο
- class sima_utils.transformer.vlm_config.VisionModelConfigο
Configuration of Vision Encoder.
- model_typeο
The type of the model.
- image_sizeο
The resolution of input images.
- patch_sizeο
The patch size to divide an image.
The dimension of embedding.
- intermediate_sizeο
The dimension of MLP layer.
- num_attention_headsο
The numbder of attention heads.
The number of transformer blocks.
The type of activation in MLP.
- layer_norm_epsο
The small value to prevent division by zero.
- model_type: str = ''ο
- arch: VisionArchTypeο
- image_size: int = 0ο
- patch_size: int = 0ο
- cls_embed: bool = Falseο
- hidden_size: int = 0ο
- intermediate_size: int = 0ο
- num_attention_heads: int = 0ο
- num_hidden_layers: int = 0ο
- hidden_act: str = 'gelu_pytorch_tanh'ο
- layer_norm_eps: float = 1e-06ο
- set_default_config(arch: VisionArchType)ο
- property num_patches: intο
- property seq_len: intο
- class sima_utils.transformer.vlm_config.MMConnectionConfigο
Configuration of Multi-Modal Connection.
The MM connection consists of 1 or 2 linear layers.
- num_layersο
The number of linear layers in the connection.
The type of activation if num_layers is 2.
- mm_tokens_per_imageο
The number of tokens projected for each image. If mm_tokens_per_image is less than num_patches, AvgPool is inserted.
- proj_dimο
The number of projected tokens for an image.
- num_layers: int = 2ο
- hidden_act: str = 'gelu'ο
- mm_tokens_per_image: int = 0ο
- proj_dim: int = 0ο
- set_default_config(vm_arch: VisionArchType, num_patches: int, lm_hidden_size: int)ο
- sima_utils.transformer.vlm_config.SPECIAL_TOKENS = ['ignore_index', 'image_token_index', 'boi_token_index', 'eoi_token_index', 'bos_token_id',...ο
- class sima_utils.transformer.vlm_config.TokenEmbedConfigο
Configuration of tokenizer and embedding.
- tokenizer_typeο
The type of tokenizer.
- tokenizer_pathο
The path of the tokenizer model.
- vocab_sizeο
The size of vocabulary of the tokenizer.
- special_tokensο
The dict of specail tokens.
- tokenizer_type: str = ''ο
- tokenizer_path: str = ''ο
- vocab_size: int = 0ο
- special_tokens: dictο
- add_special_token(name, value)ο
- class sima_utils.transformer.vlm_config.RopeScalingConfigο
Configuration of RoPE Scaling.
- factorο
The scaling factor.
- low_freq_factorο
The low frequency factor (llama3).
- high_freq_factorο
The high frequency factor (llama3).
- original_max_position_embeddingsο
The original context length used in model training with the given RoPS settings.
- long_factorο
List of scaling factors for long context (longrope).
- short_factorο
List of scaling factors for short context (longrope).
- rope_typeο
The type of RoPE scaling method. Supported types are βlinearβ or βdefaultβ, βllama3β, and βlongropeβ.
- factor: float = 1.0ο
- low_freq_factor: float = 0ο
- high_freq_factor: float = 0ο
- original_max_position_embeddings: int = 0ο
- long_factor: list[float] | None = Noneο
- short_factor: list[float] | None = Noneο
- rope_type: str = 'default'ο
- class sima_utils.transformer.vlm_config.RoPEConfigο
Configuration of Rotary Position Embedding.
- rope_thetaο
The theta for RoPE.
- rope_local_base_freqο
The local base frequency.
- rope_scalingο
The settings for RoPE scaling.
- rope_theta: float = 10000ο
- rope_local_base_freq: float = 10000ο
- rope_scaling: RopeScalingConfigο
- class sima_utils.transformer.vlm_config.AttentionBlockConfigο
Configuration of attention block.
- num_attention_headsο
The number of attention heads.
- num_key_value_headsο
The number of key/value heads.
- head_dimο
The dimension of query, key, and value heads.
- swa_enableο
The flag to turn on sliding window attention.
- swa_ratioο
The ratio of SWA over global attention.
- sliding_windowο
The size of sliding window for SWA.
- attention_biasο
Reserved for future.
- attention_dropoutο
Reserved for future.
- query_pre_attn_scalarο
Reserved for future.
- num_attention_heads: int = 0ο
- num_key_value_heads: int = 0ο
- head_dim: int = 0ο
- swa_enable: bool = Falseο
- swa_ratio: int = 0ο
- sliding_window: int = 0ο
- attention_bias: bool = Falseο
- attention_dropout: float = 0.0ο
- query_pre_attn_scalar: int = 0ο
- property q_size: intο
- property kv_size: intο
- class sima_utils.transformer.vlm_config.MlpBlockConfigο
Configuration of MLP block.
- intermediate_sizeο
The dimension of MLP layer.
- actο
The type of activation.
- num_layersο
The number of layers in MLP.
- mlp_biasο
Reserved for future use.
- intermediate_size: int = 0ο
- act: str = 'silu'ο
- num_layers: int = 3ο
- mlp_bias: bool = Falseο
- class sima_utils.transformer.vlm_config.LanguageModelConfigο
Configuration of LLM.
- model_typeο
The type of LLM.
- data_typeο
The data type.
- archο
The architecture of the LLM.
- genο
The generation version of the LLM.
- sizeο
The billion-designation of the LLM.
- token_cfgο
The settings of tokenizer.
- rope_cfgο
The settings of RoPE.
- attn_cfgο
The settings of attention block.
- mlp_cfgο
The settings of MLP block.
The dimension of the embedding.
The number of transformer blocks.
- max_position_embeddingsο
The context length.
- rms_norm_epsο
The small value in RMS norm to prevent zero devision.
- layer_normsο
Types of RMS norm used in a transformer block.
- attn_logit_softcappingο
Gemma 2 attention logit soft capping.
- final_logit_softcappingο
Gemma 2 final logit soft capping.
- model_type: str = ''ο
- data_type: LlmDataTypeο
- arch: LlmArchTypeο
- gen: LlmArchVersionο
- size: str = '7b'ο
- token_cfg: TokenEmbedConfigο
- rope_cfg: RoPEConfigο
- attn_cfg: AttentionBlockConfigο
- mlp_cfg: MlpBlockConfigο
- hidden_size: int = 0ο
- num_hidden_layers: int = 0ο
- max_position_embeddings: int = 0ο
- rms_norm_eps: float = 1e-05ο
- layer_norms: list[str] = []ο
- attn_logit_softcapping: float | None = Noneο
- final_logit_softcapping: float | None = Noneο
- lm_head_num_splits: int = 1ο
- lm_head_split_dim: int = 0ο
- static load(cfg: dict) LanguageModelConfig ο
- update(cfg: dict)ο
- class sima_utils.transformer.vlm_config.PipelineConfigο
Configuration of VLM pipeline.
- system_promptο
The system prompt.
- max_num_tokensο
The max number of tokens including the input and generated tokens.
- input_token_group_sizeο
The group size of input token.
- input_token_group_offsetsο
The group offsets of input token.
- system_prompt: str | None = Noneο
- max_num_tokens: int = 1024ο
- input_token_group_size: int = 1ο
- input_token_group_offsets: list[int] | None = Noneο
- future_token_mask_size: int = 1ο
- set_system_prompt(prompt: str | None)ο
- set_max_num_tokens(max_num_tokens: int)ο
- set_group_size(size: int)ο
- set_group_offsets(offsets: list[int])ο
- set_future_token_mask_size(mask_size: int)ο
- class sima_utils.transformer.vlm_config.VlmConfigο
Configuration of Vision Language Model.
- model_nameο
The name of the model.
- Type:
str
- model_typeο
The type of the model.
- Type:
str
- vm_cfgο
The settings of vision model.
- Type:
VisionModelConfig | None
- mm_cfgο
The settings of multi-modal connection.
- Type:
MMConnectionConfig | None
- lm_cfgο
The settings of language model.
- Type:
- pipeline_cfgο
The settings of application pipeline.
- Type:
- model_name: str = ''ο
- model_type: VlmArchType | None = Noneο
- vm_cfg: VisionModelConfig | None = Noneο
- mm_cfg: MMConnectionConfig | None = Noneο
- lm_cfg: LanguageModelConfigο
- pipeline_cfg: PipelineConfigο
- set_default_config(dtype: LlmDataType, vm_arch: VisionArchType | None, lm_arch: LlmArchType, gen: LlmArchVersion, b_size: str)ο
- set_tokenizer_path(tokenizer_path: pathlib.Path)ο
- static from_hf_config(model_path: pathlib.Path, model_cfg: dict) VlmConfig ο
- Generate SiMaβs configuration for VLM
from a HuggingFace config dict and MLA constraints.
- Parameters:
model_path β The path of the source model.
model_cfg β The config dict of the source model.
- Returns:
VlmConfig for the model.
- property is_multimodalο
- update_special_tokens(cfg: dict)ο
- update_vision_model_params(cfg: dict)ο
- update_mm_connection_params(cfg: dict)ο
- update_language_model_params(cfg: dict)ο
- config_pipeline(system_prompt: str | None, max_num_tokens: int, tokenizer: sima_utils.transformer.llm_tokenizer.LlmTokenizer, estimated_max_num_query_tokens: int = 100)ο
- class sima_utils.transformer.vlm_config.VlmHelper(vlm_cfg: VlmConfig, system_prompt: str | None = None)ο
VLM helper class with processors.
- prompt_formatter: sima_utils.transformer.prompt_template.PromptFormatterο
- image_preprocessor: sima_utils.transformer.vision_preprocessor.ImageProcessor | Noneο
- preprocess(query: str, image: pathlib.Path | str | numpy.ndarray | None) tuple[str, numpy.ndarray, numpy.ndarray | None] ο
Preprocess the input query and the image.
- Parameters:
query β Input query string.
image β Path to the image or the loaded image in numpy array. Set to None if no image.
- Returns:
Tuple of formatted prompt, tokenized input query and preprocessed image.
- postprocess(output_tokens: numpy.ndarray | list[int]) str ο
- sima_utils.transformer.vlm_config.get_model_arch_gen(model_arch: str, model_type: str, text_type: str) tuple[VisionArchType | None, LlmArchType, LlmArchVersion | None] ο
Derive VLM architecture and version from the model arch and type.
The model_arch will contain βForConditionalGenerationβ for VLM and βForCausalLMβ for LM. The model_type may contain version information.
- Parameters:
model_arch β The architecture of a model.
model_type β The type of a model.
text_type β The type of language model.
- Returns:
Tuple of vision architecture, LLM architecture and version.
- sima_utils.transformer.vlm_config.get_model_version_from_intermediate_size(arch: LlmArchType, cfg: dict) LlmArchVersion ο
Get the version of a model by matching intermediate_size.
- Parameters:
arch β The architecture type of a model.
cfg β The configuration dictionary of the model.
- Returns:
The version of the model in its architecture family.
Get the size of a model by matching hidden_size and number of layers.
- Parameters:
arch β The architecture type of a model.
gen β The version of the model within the architecture family.
cfg β The configuration dictionary of the model.
- Returns:
The size of the model as string in billion-designation.
- Raises:
Value error if no matching hidden_size and number of layers are found. β
- sima_utils.transformer.vlm_config.llm_parameter_count(arch: LlmArchType, gen: LlmArchVersion, size_b: int, cfg: dict) int ο
Calculate parameter count of an LLM model.
- Parameters:
arch (str) β The architecture name of a model.
gen (str) β The version of the model within the architecture family.
size_b (str) β The billion-size designation of the model.
cfg (dict) β The configuration dictionary of the model.
- Returns:
The size of the model in terms of the parameter count.
- sima_utils.transformer.vlm_config.apply_mla_constraint(vlm_cfg: VlmConfig) None ο
Apply MLA constraints.
TODO: modify RoPE if context_length (max_position_embeddings) is changed.
- Parameters:
vlm_cfg (VlmConfig) β The configuration of VLM model.
- Returns:
None. Change is made in place.
- sima_utils.transformer.vlm_config.input_pathο