sima_utils.transformer.vlm_config ================================= .. py:module:: sima_utils.transformer.vlm_config Attributes ---------- .. autoapisummary:: sima_utils.transformer.vlm_config.MLA_CONSTRAINTS sima_utils.transformer.vlm_config.SPECIAL_TOKENS sima_utils.transformer.vlm_config.input_path Classes ------- .. autoapisummary:: sima_utils.transformer.vlm_config.ExtendedEnum sima_utils.transformer.vlm_config.VlmArchType sima_utils.transformer.vlm_config.VisionArchType sima_utils.transformer.vlm_config.LlmArchType sima_utils.transformer.vlm_config.LlmArchVersion sima_utils.transformer.vlm_config.LlmDataType sima_utils.transformer.vlm_config.GgufDataType sima_utils.transformer.vlm_config.BaseConfig sima_utils.transformer.vlm_config.VisionModelConfig sima_utils.transformer.vlm_config.MMConnectionConfig sima_utils.transformer.vlm_config.TokenEmbedConfig sima_utils.transformer.vlm_config.RopeScalingConfig sima_utils.transformer.vlm_config.RoPEConfig sima_utils.transformer.vlm_config.AttentionBlockConfig sima_utils.transformer.vlm_config.MlpBlockConfig sima_utils.transformer.vlm_config.LanguageModelConfig sima_utils.transformer.vlm_config.PipelineConfig sima_utils.transformer.vlm_config.VlmConfig sima_utils.transformer.vlm_config.VlmHelper Functions --------- .. autoapisummary:: sima_utils.transformer.vlm_config.get_model_arch_gen sima_utils.transformer.vlm_config.get_model_version_from_intermediate_size sima_utils.transformer.vlm_config.get_model_size_from_hidden_parameters sima_utils.transformer.vlm_config.llm_parameter_count sima_utils.transformer.vlm_config.apply_mla_constraint Module Contents --------------- .. py:class:: ExtendedEnum Add class methods to Enum. .. py:method:: values() :classmethod: .. py:method:: names() :classmethod: .. py:class:: VlmArchType VLM architecture type. .. py:attribute:: VLM_LLAVA :value: 'vlm-llava' .. py:attribute:: VLM_PALIGEMMA :value: 'vlm-paligemma' .. py:attribute:: VLM_GEMMA3 :value: 'vlm-gemma3' .. py:attribute:: VLM_CUSTOM :value: 'vlm-custom' .. py:attribute:: LLM_LLAMA2 :value: 'llm-llama2' .. py:attribute:: LLM_LLAMA3_1 :value: 'llm-llama3.1' .. py:attribute:: LLM_LLAMA3_2 :value: 'llm-llama3.2' .. py:attribute:: LLM_GEMMA1 :value: 'llm-gemma1' .. py:attribute:: LLM_GEMMA2 :value: 'llm-gemma2' .. py:attribute:: LLM_GEMMA3 :value: 'llm-gemma3' .. py:attribute:: LLM_PHI3_5 :value: 'llm-phi3.5' .. py:class:: VisionArchType Vision architecture type. .. py:attribute:: CLIP :value: 'clip' .. py:attribute:: SIGLIP :value: 'siglip' .. py:class:: LlmArchType LLM architecture type. .. py:attribute:: LLAMA :value: 'llama' .. py:attribute:: GEMMA :value: 'gemma' .. py:attribute:: PHI :value: 'phi' .. py:class:: LlmArchVersion LLM architecture version. .. py:attribute:: GEN_1 :value: '1' .. py:attribute:: GEN_2 :value: '2' .. py:attribute:: GEN_3 :value: '3' .. py:attribute:: GEN_3_1 :value: '3.1' .. py:attribute:: GEN_3_2 :value: '3.2' .. py:attribute:: GEN_3_5 :value: '3.5' .. py:class:: LlmDataType LLM data type. .. py:attribute:: F32 :value: 'float32' .. py:attribute:: F16 :value: 'float16' .. py:attribute:: BF16 :value: 'bfloat16' .. py:class:: GgufDataType LLM data type format by GGUF. .. py:attribute:: F32 :value: 0 .. py:attribute:: F16 :value: 1 .. py:data:: MLA_CONSTRAINTS :type: dict .. py:class:: BaseConfig Base configuration with an update method. .. py:method:: update(cfg) .. py:class:: VisionModelConfig Configuration of Vision Encoder. .. attribute:: model_type The type of the model. .. attribute:: image_size The resolution of input images. .. attribute:: patch_size The patch size to divide an image. .. attribute:: hidden_size The dimension of embedding. .. attribute:: intermediate_size The dimension of MLP layer. .. attribute:: num_attention_heads The numbder of attention heads. .. attribute:: num_hidden_layers The number of transformer blocks. .. attribute:: hidden_act The type of activation in MLP. .. attribute:: layer_norm_eps The small value to prevent division by zero. .. py:attribute:: model_type :type: str :value: '' .. py:attribute:: arch :type: VisionArchType .. py:attribute:: image_size :type: int :value: 0 .. py:attribute:: patch_size :type: int :value: 0 .. py:attribute:: cls_embed :type: bool :value: False .. py:attribute:: hidden_size :type: int :value: 0 .. py:attribute:: intermediate_size :type: int :value: 0 .. py:attribute:: num_attention_heads :type: int :value: 0 .. py:attribute:: num_hidden_layers :type: int :value: 0 .. py:attribute:: hidden_act :type: str :value: 'gelu_pytorch_tanh' .. py:attribute:: layer_norm_eps :type: float :value: 1e-06 .. py:method:: set_default_config(arch: VisionArchType) .. py:property:: num_patches :type: int .. py:property:: seq_len :type: int .. py:class:: MMConnectionConfig Configuration of Multi-Modal Connection. The MM connection consists of 1 or 2 linear layers. .. attribute:: num_layers The number of linear layers in the connection. .. attribute:: hidden_act The type of activation if num_layers is 2. .. attribute:: mm_tokens_per_image The number of tokens projected for each image. If mm_tokens_per_image is less than num_patches, AvgPool is inserted. .. attribute:: proj_dim The number of projected tokens for an image. .. py:attribute:: num_layers :type: int :value: 2 .. py:attribute:: hidden_act :type: str :value: 'gelu' .. py:attribute:: mm_tokens_per_image :type: int :value: 0 .. py:attribute:: proj_dim :type: int :value: 0 .. py:method:: set_default_config(vm_arch: VisionArchType, num_patches: int, lm_hidden_size: int) .. py:data:: SPECIAL_TOKENS :value: ['ignore_index', 'image_token_index', 'boi_token_index', 'eoi_token_index', 'bos_token_id',... .. py:class:: TokenEmbedConfig Configuration of tokenizer and embedding. .. attribute:: tokenizer_type The type of tokenizer. .. attribute:: tokenizer_path The path of the tokenizer model. .. attribute:: vocab_size The size of vocabulary of the tokenizer. .. attribute:: special_tokens The dict of specail tokens. .. py:attribute:: tokenizer_type :type: str :value: '' .. py:attribute:: tokenizer_path :type: str :value: '' .. py:attribute:: vocab_size :type: int :value: 0 .. py:attribute:: special_tokens :type: dict .. py:method:: add_special_token(name, value) .. py:class:: RopeScalingConfig Configuration of RoPE Scaling. .. attribute:: factor The scaling factor. .. attribute:: low_freq_factor The low frequency factor (llama3). .. attribute:: high_freq_factor The high frequency factor (llama3). .. attribute:: original_max_position_embeddings The original context length used in model training with the given RoPS settings. .. attribute:: long_factor List of scaling factors for long context (longrope). .. attribute:: short_factor List of scaling factors for short context (longrope). .. attribute:: rope_type The type of RoPE scaling method. Supported types are "linear" or "default", "llama3", and "longrope". .. py:attribute:: factor :type: float :value: 1.0 .. py:attribute:: low_freq_factor :type: float :value: 0 .. py:attribute:: high_freq_factor :type: float :value: 0 .. py:attribute:: original_max_position_embeddings :type: int :value: 0 .. py:attribute:: long_factor :type: list[float] | None :value: None .. py:attribute:: short_factor :type: list[float] | None :value: None .. py:attribute:: rope_type :type: str :value: 'default' .. py:class:: RoPEConfig Configuration of Rotary Position Embedding. .. attribute:: rope_theta The theta for RoPE. .. attribute:: rope_local_base_freq The local base frequency. .. attribute:: rope_scaling The settings for RoPE scaling. .. py:attribute:: rope_theta :type: float :value: 10000 .. py:attribute:: rope_local_base_freq :type: float :value: 10000 .. py:attribute:: rope_scaling :type: RopeScalingConfig .. py:class:: AttentionBlockConfig Configuration of attention block. .. attribute:: num_attention_heads The number of attention heads. .. attribute:: num_key_value_heads The number of key/value heads. .. attribute:: head_dim The dimension of query, key, and value heads. .. attribute:: swa_enable The flag to turn on sliding window attention. .. attribute:: swa_ratio The ratio of SWA over global attention. .. attribute:: sliding_window The size of sliding window for SWA. .. attribute:: attention_bias Reserved for future. .. attribute:: attention_dropout Reserved for future. .. attribute:: query_pre_attn_scalar Reserved for future. .. py:attribute:: num_attention_heads :type: int :value: 0 .. py:attribute:: num_key_value_heads :type: int :value: 0 .. py:attribute:: head_dim :type: int :value: 0 .. py:attribute:: swa_enable :type: bool :value: False .. py:attribute:: swa_ratio :type: int :value: 0 .. py:attribute:: sliding_window :type: int :value: 0 .. py:attribute:: attention_bias :type: bool :value: False .. py:attribute:: attention_dropout :type: float :value: 0.0 .. py:attribute:: query_pre_attn_scalar :type: int :value: 0 .. py:property:: q_size :type: int .. py:property:: kv_size :type: int .. py:class:: MlpBlockConfig Configuration of MLP block. .. attribute:: intermediate_size The dimension of MLP layer. .. attribute:: act The type of activation. .. attribute:: num_layers The number of layers in MLP. .. attribute:: mlp_bias Reserved for future use. .. py:attribute:: intermediate_size :type: int :value: 0 .. py:attribute:: act :type: str :value: 'silu' .. py:attribute:: num_layers :type: int :value: 3 .. py:attribute:: mlp_bias :type: bool :value: False .. py:class:: LanguageModelConfig Configuration of LLM. .. attribute:: model_type The type of LLM. .. attribute:: data_type The data type. .. attribute:: arch The architecture of the LLM. .. attribute:: gen The generation version of the LLM. .. attribute:: size The billion-designation of the LLM. .. attribute:: token_cfg The settings of tokenizer. .. attribute:: rope_cfg The settings of RoPE. .. attribute:: attn_cfg The settings of attention block. .. attribute:: mlp_cfg The settings of MLP block. .. attribute:: hidden_size The dimension of the embedding. .. attribute:: num_hidden_layers The number of transformer blocks. .. attribute:: max_position_embeddings The context length. .. attribute:: rms_norm_eps The small value in RMS norm to prevent zero devision. .. attribute:: layer_norms Types of RMS norm used in a transformer block. .. attribute:: attn_logit_softcapping Gemma 2 attention logit soft capping. .. attribute:: final_logit_softcapping Gemma 2 final logit soft capping. .. py:attribute:: model_type :type: str :value: '' .. py:attribute:: data_type :type: LlmDataType .. py:attribute:: arch :type: LlmArchType .. py:attribute:: gen :type: LlmArchVersion .. py:attribute:: size :type: str :value: '7b' .. py:attribute:: token_cfg :type: TokenEmbedConfig .. py:attribute:: rope_cfg :type: RoPEConfig .. py:attribute:: attn_cfg :type: AttentionBlockConfig .. py:attribute:: mlp_cfg :type: MlpBlockConfig .. py:attribute:: hidden_size :type: int :value: 0 .. py:attribute:: num_hidden_layers :type: int :value: 0 .. py:attribute:: max_position_embeddings :type: int :value: 0 .. py:attribute:: rms_norm_eps :type: float :value: 1e-05 .. py:attribute:: layer_norms :type: list[str] :value: [] .. py:attribute:: attn_logit_softcapping :type: float | None :value: None .. py:attribute:: final_logit_softcapping :type: float | None :value: None .. py:attribute:: lm_head_num_splits :type: int :value: 1 .. py:attribute:: lm_head_split_dim :type: int :value: 0 .. py:method:: load(cfg: dict) -> LanguageModelConfig :staticmethod: .. py:method:: update(cfg: dict) .. py:class:: PipelineConfig Configuration of VLM pipeline. .. attribute:: system_prompt The system prompt. .. attribute:: max_num_tokens The max number of tokens including the input and generated tokens. .. attribute:: input_token_group_size The group size of input token. .. attribute:: input_token_group_offsets The group offsets of input token. .. py:attribute:: system_prompt :type: str | None :value: None .. py:attribute:: max_num_tokens :type: int :value: 1024 .. py:attribute:: input_token_group_size :type: int :value: 1 .. py:attribute:: input_token_group_offsets :type: list[int] | None :value: None .. py:attribute:: future_token_mask_size :type: int :value: 1 .. py:method:: set_system_prompt(prompt: str | None) .. py:method:: set_max_num_tokens(max_num_tokens: int) .. py:method:: set_group_size(size: int) .. py:method:: set_group_offsets(offsets: list[int]) .. py:method:: set_future_token_mask_size(mask_size: int) .. py:class:: VlmConfig Configuration of Vision Language Model. .. attribute:: model_name The name of the model. :type: str .. attribute:: model_type The type of the model. :type: str .. attribute:: vm_cfg The settings of vision model. :type: VisionModelConfig | None .. attribute:: mm_cfg The settings of multi-modal connection. :type: MMConnectionConfig | None .. attribute:: lm_cfg The settings of language model. :type: LanguageModelConfig .. attribute:: pipeline_cfg The settings of application pipeline. :type: PipelineConfig .. py:attribute:: model_name :type: str :value: '' .. py:attribute:: model_type :type: VlmArchType | None :value: None .. py:attribute:: vm_cfg :type: VisionModelConfig | None :value: None .. py:attribute:: mm_cfg :type: MMConnectionConfig | None :value: None .. py:attribute:: lm_cfg :type: LanguageModelConfig .. py:attribute:: pipeline_cfg :type: PipelineConfig .. py:method:: load(vlm_cfg: dict) -> VlmConfig :staticmethod: .. py:method:: set_default_config(dtype: LlmDataType, vm_arch: VisionArchType | None, lm_arch: LlmArchType, gen: LlmArchVersion, b_size: str) .. py:method:: set_tokenizer_path(tokenizer_path: pathlib.Path) .. py:method:: from_hf_config(model_path: pathlib.Path, model_cfg: dict) -> VlmConfig :staticmethod: Generate SiMa's configuration for VLM from a HuggingFace config dict and MLA constraints. :param model_path: The path of the source model. :param model_cfg: The config dict of the source model. :returns: VlmConfig for the model. .. py:property:: is_multimodal .. py:method:: update_special_tokens(cfg: dict) .. py:method:: update_vision_model_params(cfg: dict) .. py:method:: update_mm_connection_params(cfg: dict) .. py:method:: update_language_model_params(cfg: dict) .. py:method:: config_pipeline(system_prompt: str | None, max_num_tokens: int, tokenizer: sima_utils.transformer.llm_tokenizer.LlmTokenizer, estimated_max_num_query_tokens: int = 100) .. py:class:: VlmHelper(vlm_cfg: VlmConfig, system_prompt: str | None = None) VLM helper class with processors. .. py:attribute:: tokenizer :type: sima_utils.transformer.llm_tokenizer.LlmTokenizer .. py:attribute:: prompt_formatter :type: sima_utils.transformer.prompt_template.PromptFormatter .. py:attribute:: image_preprocessor :type: sima_utils.transformer.vision_preprocessor.ImageProcessor | None .. py:method:: preprocess(query: str, image: pathlib.Path | str | numpy.ndarray | None) -> tuple[str, numpy.ndarray, numpy.ndarray | None] Preprocess the input query and the image. :param query: Input query string. :param image: Path to the image or the loaded image in numpy array. Set to None if no image. :returns: Tuple of formatted prompt, tokenized input query and preprocessed image. .. py:method:: postprocess(output_tokens: numpy.ndarray | list[int]) -> str .. py:function:: get_model_arch_gen(model_arch: str, model_type: str, text_type: str) -> tuple[VisionArchType | None, LlmArchType, LlmArchVersion | None] Derive VLM architecture and version from the model arch and type. The model_arch will contain "ForConditionalGeneration" for VLM and "ForCausalLM" for LM. The model_type may contain version information. :param model_arch: The architecture of a model. :param model_type: The type of a model. :param text_type: The type of language model. :returns: Tuple of vision architecture, LLM architecture and version. .. py:function:: get_model_version_from_intermediate_size(arch: LlmArchType, cfg: dict) -> LlmArchVersion Get the version of a model by matching intermediate_size. :param arch: The architecture type of a model. :param cfg: The configuration dictionary of the model. :returns: The version of the model in its architecture family. .. py:function:: get_model_size_from_hidden_parameters(arch: LlmArchType, gen: LlmArchVersion, cfg: dict) -> str Get the size of a model by matching hidden_size and number of layers. :param arch: The architecture type of a model. :param gen: The version of the model within the architecture family. :param cfg: The configuration dictionary of the model. :returns: The size of the model as string in billion-designation. :raises Value error if no matching hidden_size and number of layers are found.: .. py:function:: llm_parameter_count(arch: LlmArchType, gen: LlmArchVersion, size_b: int, cfg: dict) -> int Calculate parameter count of an LLM model. :param arch: The architecture name of a model. :type arch: str :param gen: The version of the model within the architecture family. :type gen: str :param size_b: The billion-size designation of the model. :type size_b: str :param cfg: The configuration dictionary of the model. :type cfg: dict :returns: The size of the model in terms of the parameter count. .. py:function:: apply_mla_constraint(vlm_cfg: VlmConfig) -> None Apply MLA constraints. TODO: modify RoPE if context_length (max_position_embeddings) is changed. :param vlm_cfg: The configuration of VLM model. :type vlm_cfg: VlmConfig :returns: None. Change is made in place. .. py:data:: input_path