sima_utils.transformer.vlm_config
=================================

.. py:module:: sima_utils.transformer.vlm_config


Attributes
----------

.. autoapisummary::

   sima_utils.transformer.vlm_config.MLA_CONSTRAINTS
   sima_utils.transformer.vlm_config.SPECIAL_TOKENS
   sima_utils.transformer.vlm_config.input_path


Classes
-------

.. autoapisummary::

   sima_utils.transformer.vlm_config.ExtendedEnum
   sima_utils.transformer.vlm_config.VlmArchType
   sima_utils.transformer.vlm_config.VisionArchType
   sima_utils.transformer.vlm_config.LlmArchType
   sima_utils.transformer.vlm_config.LlmArchVersion
   sima_utils.transformer.vlm_config.LlmDataType
   sima_utils.transformer.vlm_config.GgufDataType
   sima_utils.transformer.vlm_config.BaseConfig
   sima_utils.transformer.vlm_config.VisionModelConfig
   sima_utils.transformer.vlm_config.MMConnectionConfig
   sima_utils.transformer.vlm_config.TokenEmbedConfig
   sima_utils.transformer.vlm_config.RopeScalingConfig
   sima_utils.transformer.vlm_config.RoPEConfig
   sima_utils.transformer.vlm_config.AttentionBlockConfig
   sima_utils.transformer.vlm_config.MlpBlockConfig
   sima_utils.transformer.vlm_config.LanguageModelConfig
   sima_utils.transformer.vlm_config.PipelineConfig
   sima_utils.transformer.vlm_config.VlmConfig
   sima_utils.transformer.vlm_config.VlmHelper


Functions
---------

.. autoapisummary::

   sima_utils.transformer.vlm_config.get_model_arch_gen
   sima_utils.transformer.vlm_config.get_model_version_from_intermediate_size
   sima_utils.transformer.vlm_config.get_model_size_from_hidden_parameters
   sima_utils.transformer.vlm_config.llm_parameter_count
   sima_utils.transformer.vlm_config.apply_mla_constraint


Module Contents
---------------

.. py:class:: ExtendedEnum


   Add class methods to Enum.


   .. py:method:: values()
      :classmethod:


   .. py:method:: names()
      :classmethod:


.. py:class:: VlmArchType


   VLM architecture type.


   .. py:attribute:: VLM_LLAVA
      :value: 'vlm-llava'


   .. py:attribute:: VLM_PALIGEMMA
      :value: 'vlm-paligemma'


   .. py:attribute:: VLM_GEMMA3
      :value: 'vlm-gemma3'


   .. py:attribute:: VLM_CUSTOM
      :value: 'vlm-custom'


   .. py:attribute:: LLM_LLAMA2
      :value: 'llm-llama2'


   .. py:attribute:: LLM_LLAMA3_1
      :value: 'llm-llama3.1'


   .. py:attribute:: LLM_LLAMA3_2
      :value: 'llm-llama3.2'


   .. py:attribute:: LLM_GEMMA1
      :value: 'llm-gemma1'


   .. py:attribute:: LLM_GEMMA2
      :value: 'llm-gemma2'


   .. py:attribute:: LLM_GEMMA3
      :value: 'llm-gemma3'


   .. py:attribute:: LLM_PHI3_5
      :value: 'llm-phi3.5'


.. py:class:: VisionArchType


   Vision architecture type.


   .. py:attribute:: CLIP
      :value: 'clip'


   .. py:attribute:: SIGLIP
      :value: 'siglip'


.. py:class:: LlmArchType


   LLM architecture type.


   .. py:attribute:: LLAMA
      :value: 'llama'


   .. py:attribute:: GEMMA
      :value: 'gemma'


   .. py:attribute:: PHI
      :value: 'phi'


.. py:class:: LlmArchVersion


   LLM architecture version.


   .. py:attribute:: GEN_1
      :value: '1'


   .. py:attribute:: GEN_2
      :value: '2'


   .. py:attribute:: GEN_3
      :value: '3'


   .. py:attribute:: GEN_3_1
      :value: '3.1'


   .. py:attribute:: GEN_3_2
      :value: '3.2'


   .. py:attribute:: GEN_3_5
      :value: '3.5'


.. py:class:: LlmDataType


   LLM data type.


   .. py:attribute:: F32
      :value: 'float32'


   .. py:attribute:: F16
      :value: 'float16'


   .. py:attribute:: BF16
      :value: 'bfloat16'


.. py:class:: GgufDataType


   LLM data type format by GGUF.


   .. py:attribute:: F32
      :value: 0


   .. py:attribute:: F16
      :value: 1


.. py:data:: MLA_CONSTRAINTS
   :type:  dict

.. py:class:: BaseConfig


   Base configuration with an update method.


   .. py:method:: update(cfg)


.. py:class:: VisionModelConfig


   Configuration of Vision Encoder.

   .. attribute:: model_type

      The type of the model.

   .. attribute:: image_size

      The resolution of input images.

   .. attribute:: patch_size

      The patch size to divide an image.

   .. attribute:: hidden_size

      The dimension of embedding.

   .. attribute:: intermediate_size

      The dimension of MLP layer.

   .. attribute:: num_attention_heads

      The numbder of attention heads.

   .. attribute:: num_hidden_layers

      The number of transformer blocks.

   .. attribute:: hidden_act

      The type of activation in MLP.

   .. attribute:: layer_norm_eps

      The small value to prevent division by zero.


   .. py:attribute:: model_type
      :type:  str
      :value: ''


   .. py:attribute:: arch
      :type:  VisionArchType


   .. py:attribute:: image_size
      :type:  int
      :value: 0


   .. py:attribute:: patch_size
      :type:  int
      :value: 0


   .. py:attribute:: cls_embed
      :type:  bool
      :value: False


   .. py:attribute:: hidden_size
      :type:  int
      :value: 0


   .. py:attribute:: intermediate_size
      :type:  int
      :value: 0


   .. py:attribute:: num_attention_heads
      :type:  int
      :value: 0


   .. py:attribute:: num_hidden_layers
      :type:  int
      :value: 0


   .. py:attribute:: hidden_act
      :type:  str
      :value: 'gelu_pytorch_tanh'


   .. py:attribute:: layer_norm_eps
      :type:  float
      :value: 1e-06


   .. py:method:: set_default_config(arch: VisionArchType)


   .. py:property:: num_patches
      :type: int


   .. py:property:: seq_len
      :type: int


.. py:class:: MMConnectionConfig


   Configuration of Multi-Modal Connection.

   The MM connection consists of 1 or 2 linear layers.

   .. attribute:: num_layers

      The number of linear layers in the connection.

   .. attribute:: hidden_act

      The type of activation if num_layers is 2.

   .. attribute:: mm_tokens_per_image

      The number of tokens projected for each image.
      If mm_tokens_per_image is less than num_patches, AvgPool is inserted.

   .. attribute:: proj_dim

      The number of projected tokens for an image.


   .. py:attribute:: num_layers
      :type:  int
      :value: 2


   .. py:attribute:: hidden_act
      :type:  str
      :value: 'gelu'


   .. py:attribute:: mm_tokens_per_image
      :type:  int
      :value: 0


   .. py:attribute:: proj_dim
      :type:  int
      :value: 0


   .. py:method:: set_default_config(vm_arch: VisionArchType, num_patches: int, lm_hidden_size: int)


.. py:data:: SPECIAL_TOKENS
   :value: ['ignore_index', 'image_token_index', 'boi_token_index', 'eoi_token_index', 'bos_token_id',...


.. py:class:: TokenEmbedConfig


   Configuration of tokenizer and embedding.

   .. attribute:: tokenizer_type

      The type of tokenizer.

   .. attribute:: tokenizer_path

      The path of the tokenizer model.

   .. attribute:: vocab_size

      The size of vocabulary of the tokenizer.

   .. attribute:: special_tokens

      The dict of specail tokens.


   .. py:attribute:: tokenizer_type
      :type:  str
      :value: ''


   .. py:attribute:: tokenizer_path
      :type:  str
      :value: ''


   .. py:attribute:: vocab_size
      :type:  int
      :value: 0


   .. py:attribute:: special_tokens
      :type:  dict


   .. py:method:: add_special_token(name, value)


.. py:class:: RopeScalingConfig


   Configuration of RoPE Scaling.

   .. attribute:: factor

      The scaling factor.

   .. attribute:: low_freq_factor

      The low frequency factor (llama3).

   .. attribute:: high_freq_factor

      The high frequency factor (llama3).

   .. attribute:: original_max_position_embeddings

      The original context length used in model training with
      the given RoPS settings.

   .. attribute:: long_factor

      List of scaling factors for long context (longrope).

   .. attribute:: short_factor

      List of scaling factors for short context (longrope).

   .. attribute:: rope_type

      The type of RoPE scaling method. Supported types are "linear" or "default",
      "llama3", and "longrope".


   .. py:attribute:: factor
      :type:  float
      :value: 1.0


   .. py:attribute:: low_freq_factor
      :type:  float
      :value: 0


   .. py:attribute:: high_freq_factor
      :type:  float
      :value: 0


   .. py:attribute:: original_max_position_embeddings
      :type:  int
      :value: 0


   .. py:attribute:: long_factor
      :type:  list[float] | None
      :value: None


   .. py:attribute:: short_factor
      :type:  list[float] | None
      :value: None


   .. py:attribute:: rope_type
      :type:  str
      :value: 'default'


.. py:class:: RoPEConfig


   Configuration of Rotary Position Embedding.

   .. attribute:: rope_theta

      The theta for RoPE.

   .. attribute:: rope_local_base_freq

      The local base frequency.

   .. attribute:: rope_scaling

      The settings for RoPE scaling.


   .. py:attribute:: rope_theta
      :type:  float
      :value: 10000


   .. py:attribute:: rope_local_base_freq
      :type:  float
      :value: 10000


   .. py:attribute:: rope_scaling
      :type:  RopeScalingConfig


.. py:class:: AttentionBlockConfig


   Configuration of attention block.

   .. attribute:: num_attention_heads

      The number of attention heads.

   .. attribute:: num_key_value_heads

      The number of key/value heads.

   .. attribute:: head_dim

      The dimension of query, key, and value heads.

   .. attribute:: swa_enable

      The flag to turn on sliding window attention.

   .. attribute:: swa_ratio

      The ratio of SWA over global attention.

   .. attribute:: sliding_window

      The size of sliding window for SWA.

   .. attribute:: attention_bias

      Reserved for future.

   .. attribute:: attention_dropout

      Reserved for future.

   .. attribute:: query_pre_attn_scalar

      Reserved for future.


   .. py:attribute:: num_attention_heads
      :type:  int
      :value: 0


   .. py:attribute:: num_key_value_heads
      :type:  int
      :value: 0


   .. py:attribute:: head_dim
      :type:  int
      :value: 0


   .. py:attribute:: swa_enable
      :type:  bool
      :value: False


   .. py:attribute:: swa_ratio
      :type:  int
      :value: 0


   .. py:attribute:: sliding_window
      :type:  int
      :value: 0


   .. py:attribute:: attention_bias
      :type:  bool
      :value: False


   .. py:attribute:: attention_dropout
      :type:  float
      :value: 0.0


   .. py:attribute:: query_pre_attn_scalar
      :type:  int
      :value: 0


   .. py:property:: q_size
      :type: int


   .. py:property:: kv_size
      :type: int


.. py:class:: MlpBlockConfig


   Configuration of MLP block.

   .. attribute:: intermediate_size

      The dimension of MLP layer.

   .. attribute:: act

      The type of activation.

   .. attribute:: num_layers

      The number of layers in MLP.

   .. attribute:: mlp_bias

      Reserved for future use.


   .. py:attribute:: intermediate_size
      :type:  int
      :value: 0


   .. py:attribute:: act
      :type:  str
      :value: 'silu'


   .. py:attribute:: num_layers
      :type:  int
      :value: 3


   .. py:attribute:: mlp_bias
      :type:  bool
      :value: False


.. py:class:: LanguageModelConfig


   Configuration of LLM.

   .. attribute:: model_type

      The type of LLM.

   .. attribute:: data_type

      The data type.

   .. attribute:: arch

      The architecture of the LLM.

   .. attribute:: gen

      The generation version of the LLM.

   .. attribute:: size

      The billion-designation of the LLM.

   .. attribute:: token_cfg

      The settings of tokenizer.

   .. attribute:: rope_cfg

      The settings of RoPE.

   .. attribute:: attn_cfg

      The settings of attention block.

   .. attribute:: mlp_cfg

      The settings of MLP block.

   .. attribute:: hidden_size

      The dimension of the embedding.

   .. attribute:: num_hidden_layers

      The number of transformer blocks.

   .. attribute:: max_position_embeddings

      The context length.

   .. attribute:: rms_norm_eps

      The small value in RMS norm to prevent zero devision.

   .. attribute:: layer_norms

      Types of RMS norm used in a transformer block.

   .. attribute:: attn_logit_softcapping

      Gemma 2 attention logit soft capping.

   .. attribute:: final_logit_softcapping

      Gemma 2 final logit soft capping.


   .. py:attribute:: model_type
      :type:  str
      :value: ''


   .. py:attribute:: data_type
      :type:  LlmDataType


   .. py:attribute:: arch
      :type:  LlmArchType


   .. py:attribute:: gen
      :type:  LlmArchVersion


   .. py:attribute:: size
      :type:  str
      :value: '7b'


   .. py:attribute:: token_cfg
      :type:  TokenEmbedConfig


   .. py:attribute:: rope_cfg
      :type:  RoPEConfig


   .. py:attribute:: attn_cfg
      :type:  AttentionBlockConfig


   .. py:attribute:: mlp_cfg
      :type:  MlpBlockConfig


   .. py:attribute:: hidden_size
      :type:  int
      :value: 0


   .. py:attribute:: num_hidden_layers
      :type:  int
      :value: 0


   .. py:attribute:: max_position_embeddings
      :type:  int
      :value: 0


   .. py:attribute:: rms_norm_eps
      :type:  float
      :value: 1e-05


   .. py:attribute:: layer_norms
      :type:  list[str]
      :value: []


   .. py:attribute:: attn_logit_softcapping
      :type:  float | None
      :value: None


   .. py:attribute:: final_logit_softcapping
      :type:  float | None
      :value: None


   .. py:attribute:: lm_head_num_splits
      :type:  int
      :value: 1


   .. py:attribute:: lm_head_split_dim
      :type:  int
      :value: 0


   .. py:method:: load(cfg: dict) -> LanguageModelConfig
      :staticmethod:


   .. py:method:: update(cfg: dict)


.. py:class:: PipelineConfig


   Configuration of VLM pipeline.

   .. attribute:: system_prompt

      The system prompt.

   .. attribute:: max_num_tokens

      The max number of tokens including the input and generated tokens.

   .. attribute:: input_token_group_size

      The group size of input token.

   .. attribute:: input_token_group_offsets

      The group offsets of input token.


   .. py:attribute:: system_prompt
      :type:  str | None
      :value: None


   .. py:attribute:: max_num_tokens
      :type:  int
      :value: 1024


   .. py:attribute:: input_token_group_size
      :type:  int
      :value: 1


   .. py:attribute:: input_token_group_offsets
      :type:  list[int] | None
      :value: None


   .. py:attribute:: future_token_mask_size
      :type:  int
      :value: 1


   .. py:method:: set_system_prompt(prompt: str | None)


   .. py:method:: set_max_num_tokens(max_num_tokens: int)


   .. py:method:: set_group_size(size: int)


   .. py:method:: set_group_offsets(offsets: list[int])


   .. py:method:: set_future_token_mask_size(mask_size: int)


.. py:class:: VlmConfig


   Configuration of Vision Language Model.

   .. attribute:: model_name

      The name of the model.

      :type: str

   .. attribute:: model_type

      The type of the model.

      :type: str

   .. attribute:: vm_cfg

      The settings of vision model.

      :type: VisionModelConfig | None

   .. attribute:: mm_cfg

      The settings of multi-modal connection.

      :type: MMConnectionConfig | None

   .. attribute:: lm_cfg

      The settings of language model.

      :type: LanguageModelConfig

   .. attribute:: pipeline_cfg

      The settings of application pipeline.

      :type: PipelineConfig


   .. py:attribute:: model_name
      :type:  str
      :value: ''


   .. py:attribute:: model_type
      :type:  VlmArchType | None
      :value: None


   .. py:attribute:: vm_cfg
      :type:  VisionModelConfig | None
      :value: None


   .. py:attribute:: mm_cfg
      :type:  MMConnectionConfig | None
      :value: None


   .. py:attribute:: lm_cfg
      :type:  LanguageModelConfig


   .. py:attribute:: pipeline_cfg
      :type:  PipelineConfig


   .. py:method:: load(vlm_cfg: dict) -> VlmConfig
      :staticmethod:


   .. py:method:: set_default_config(dtype: LlmDataType, vm_arch: VisionArchType | None, lm_arch: LlmArchType, gen: LlmArchVersion, b_size: str)


   .. py:method:: set_tokenizer_path(tokenizer_path: pathlib.Path)


   .. py:method:: from_hf_config(model_path: pathlib.Path, model_cfg: dict) -> VlmConfig
      :staticmethod:


      Generate SiMa's configuration for VLM
          from a HuggingFace config dict and MLA constraints.

      :param model_path: The path of the source model.
      :param model_cfg: The config dict of the source model.

      :returns: VlmConfig for the model.


   .. py:property:: is_multimodal


   .. py:method:: update_special_tokens(cfg: dict)


   .. py:method:: update_vision_model_params(cfg: dict)


   .. py:method:: update_mm_connection_params(cfg: dict)


   .. py:method:: update_language_model_params(cfg: dict)


   .. py:method:: config_pipeline(system_prompt: str | None, max_num_tokens: int, tokenizer: sima_utils.transformer.llm_tokenizer.LlmTokenizer, estimated_max_num_query_tokens: int = 100)


.. py:class:: VlmHelper(vlm_cfg: VlmConfig, system_prompt: str | None = None)

   VLM helper class with processors.


   .. py:attribute:: tokenizer
      :type:  sima_utils.transformer.llm_tokenizer.LlmTokenizer


   .. py:attribute:: prompt_formatter
      :type:  sima_utils.transformer.prompt_template.PromptFormatter


   .. py:attribute:: image_preprocessor
      :type:  sima_utils.transformer.vision_preprocessor.ImageProcessor | None


   .. py:method:: preprocess(query: str, image: pathlib.Path | str | numpy.ndarray | None) -> tuple[str, numpy.ndarray, numpy.ndarray | None]

      Preprocess the input query and the image.

      :param query: Input query string.
      :param image: Path to the image or the loaded image in numpy array. Set to None if no image.

      :returns: Tuple of formatted prompt, tokenized input query and preprocessed image.


   .. py:method:: postprocess(output_tokens: numpy.ndarray | list[int]) -> str


.. py:function:: get_model_arch_gen(model_arch: str, model_type: str, text_type: str) -> tuple[VisionArchType | None, LlmArchType, LlmArchVersion | None]

   Derive VLM architecture and version from the model arch and type.

   The model_arch will contain "ForConditionalGeneration" for VLM and
   "ForCausalLM" for LM. The model_type may contain version information.

   :param model_arch: The architecture of a model.
   :param model_type: The type of a model.
   :param text_type: The type of language model.

   :returns: Tuple of vision architecture, LLM architecture and version.


.. py:function:: get_model_version_from_intermediate_size(arch: LlmArchType, cfg: dict) -> LlmArchVersion

   Get the version of a model by matching intermediate_size.

   :param arch: The architecture type of a model.
   :param cfg: The configuration dictionary of the model.

   :returns: The version of the model in its architecture family.


.. py:function:: get_model_size_from_hidden_parameters(arch: LlmArchType, gen: LlmArchVersion, cfg: dict) -> str

   Get the size of a model by matching hidden_size and number of layers.

   :param arch: The architecture type of a model.
   :param gen: The version of the model within the architecture family.
   :param cfg: The configuration dictionary of the model.

   :returns: The size of the model as string in billion-designation.

   :raises Value error if no matching hidden_size and number of layers are found.:


.. py:function:: llm_parameter_count(arch: LlmArchType, gen: LlmArchVersion, size_b: int, cfg: dict) -> int

   Calculate parameter count of an LLM model.

   :param arch: The architecture name of a model.
   :type arch: str
   :param gen: The version of the model within the architecture family.
   :type gen: str
   :param size_b: The billion-size designation of the model.
   :type size_b: str
   :param cfg: The configuration dictionary of the model.
   :type cfg: dict

   :returns: The size of the model in terms of the parameter count.


.. py:function:: apply_mla_constraint(vlm_cfg: VlmConfig) -> None

   Apply MLA constraints.

   TODO: modify RoPE if context_length (max_position_embeddings) is changed.

   :param vlm_cfg: The configuration of VLM model.
   :type vlm_cfg: VlmConfig

   :returns: None. Change is made in place.


.. py:data:: input_path