sima_utils.transformer

Submodules

Classes

`VlmArchType`	VLM architecture type.
`VlmConfig`	Configuration of Vision Language Model.
`VlmHelper`	VLM helper class with processors.

Package Contents

class sima_utils.transformer.VlmArchType

VLM architecture type.

VLM_LLAVA = 'vlm-llava'

VLM_PALIGEMMA = 'vlm-paligemma'

VLM_GEMMA3 = 'vlm-gemma3'

VLM_CUSTOM = 'vlm-custom'

LLM_LLAMA2 = 'llm-llama2'

LLM_LLAMA3_1 = 'llm-llama3.1'

LLM_LLAMA3_2 = 'llm-llama3.2'

LLM_GEMMA1 = 'llm-gemma1'

LLM_GEMMA2 = 'llm-gemma2'

LLM_GEMMA3 = 'llm-gemma3'

LLM_PHI3_5 = 'llm-phi3.5'

class sima_utils.transformer.VlmConfig

Configuration of Vision Language Model.

model_name

The name of the model.

Type:: str

model_type

The type of the model.

Type:: str

vm_cfg

The settings of vision model.

Type:: VisionModelConfig | None

mm_cfg

The settings of multi-modal connection.

Type:: MMConnectionConfig | None

lm_cfg

The settings of language model.

Type:: LanguageModelConfig

pipeline_cfg

The settings of application pipeline.

Type:: PipelineConfig

model_name: str = ''

model_type: VlmArchType | None = None

vm_cfg: VisionModelConfig | None = None

mm_cfg: MMConnectionConfig | None = None

lm_cfg: LanguageModelConfig

pipeline_cfg: PipelineConfig

static load(vlm_cfg: dict) → VlmConfig

set_default_config(dtype: LlmDataType, vm_arch: VisionArchType | None, lm_arch: LlmArchType, gen: LlmArchVersion, b_size: str)

set_tokenizer_path(tokenizer_path: pathlib.Path)

static from_hf_config(model_path: pathlib.Path, model_cfg: dict) → VlmConfig

Generate SiMa’s configuration for VLM: from a HuggingFace config dict and MLA constraints.

Parameters:

model_path – The path of the source model.
model_cfg – The config dict of the source model.

Returns:

VlmConfig for the model.

property is_multimodal

update_special_tokens(cfg: dict)

update_vision_model_params(cfg: dict)

update_mm_connection_params(cfg: dict)

update_language_model_params(cfg: dict)

config_pipeline(system_prompt: str | None, max_num_tokens: int, tokenizer: sima_utils.transformer.llm_tokenizer.LlmTokenizer, estimated_max_num_query_tokens: int = 100)

class sima_utils.transformer.VlmHelper(vlm_cfg: VlmConfig, system_prompt: str | None = None)

VLM helper class with processors.

tokenizer: sima_utils.transformer.llm_tokenizer.LlmTokenizer

prompt_formatter: sima_utils.transformer.prompt_template.PromptFormatter

image_preprocessor: sima_utils.transformer.vision_preprocessor.ImageProcessor | None

preprocess(query: str, image: pathlib.Path | str | numpy.ndarray | None) → tuple[str, numpy.ndarray, numpy.ndarray | None]

Preprocess the input query and the image.

Parameters:

query – Input query string.
image – Path to the image or the loaded image in numpy array. Set to None if no image.

Returns:

Tuple of formatted prompt, tokenized input query and preprocessed image.

postprocess(output_tokens: numpy.ndarray | list[int]) → str