GenAI Model Compilation
Overview
The ModelSDK container provides a command-line tool llima-compile to compile models directly from Hugging Face or GGUF files:
llima-compile [options] <model_path>
When you run this command, the tool handles the entire compilation pipeline including calibration, quantization, and code generation. The pipeline consists of several stages that differ slightly depending on the input format:
For HuggingFace Models:
DEVKIT - Generate runtime orchestration scripts
ONNX - Convert model to ONNX intermediate representation
QUANTIZE - Quantize model weights and calibrate
COMPILE - Compile to Modalix machine code
For GGUF Models:
DEVKIT - Generate runtime orchestration scripts
MODEL_SDK_DIRECT - Convert GGUF directly to Model SDK format (quantization already applied)
COMPILE - Compile to Modalix machine code
You can run individual stages using --onnx, --quantize, --model_sdk, --compile, or --devkit flags if needed.
The compilation process generates the following directory structure in your output directory:
output_directory/
├── onnx_files/ # ONNX intermediate files (HF models only)
│ └── ...
└── sima_files/ # Compiled model files
├── devkit/ # Python runtime orchestration files
│ ├── tokenizer.json
│ ├── vlm_config.json
│ └── ...
├── mpk/ # MPK archives with compiled binaries
│ ├── layer_0.tar.gz
│ └── ...
└── ...
Command-Line Arguments
The llima-compile tool accepts various arguments to customize the compilation process. The following tables describe the available options:
Argument |
Description |
|---|---|
|
Input model path (GGUF file or Hugging Face directory). |
|
Output directory for compiled files. Defaults to the model name. |
|
Python script to configure precision per layer (e.g., for mixed-precision). |
|
Max context length. Default: 1024. |
|
Resume interrupted builds by skipping existing files. |
|
Number of parallel compilation jobs. Default: Number of physical CPU cores. |
|
Logging level (DEBUG, INFO, WARNING, ERROR). Default: WARNING. |
Advanced Argument |
Description |
|---|---|
|
Batch size for parallel token processing during prefill. Larger values (e.g., 256) can improve TTFT for large input prompts, but can decrease TTFT for smaller input prompts. Default: 128. |
|
Mask size for reusing compiled models across token positions. Larger values reduce number of compiled binary files, but may reduce TPS. Default: 128. |
Configuration File
The configuration file allows customizing compilation on a per-layer basis, enabling mixed-precision compilation and selective layer compilation.
LLM inference consists of two distinct phases, and the compiler generates optimized models for each:
Prefill (Group models): Processes the input prompt in batches using
language_group_size(e.g., 128 tokens at once). This phase determines TTFT (Time To First Token) and is optimized for throughput.Decode (Single-token models): Generates output tokens one at a time autoregressively. This phase determines TPS (Tokens Per Second) and is optimized for low-latency generation.
Because these phases have different performance characteristics, you can apply different quantization strategies to each using the is_group flag in the configuration function.
Input Parameters
The get_layer_configuration function is called for each layer and receives:
model_properties: Dictionary with{"num_hidden_layers": int}layer: Dictionary with:"part": Layer type -"PRE","CACHE","POST", or"VISION""is_group":Truefor batch processing layers,Falsefor single-token layers"index": Layer index (0 to num_hidden_layers-1)
Return Values
The function returns a dictionary with:
"precision": Quantization level (required)"BF16": Full precision - best quality, largest size, slowest"A_BF16_W_INT8": Medium quantization - good quality, moderate size"A_BF16_W_INT4": High quantization - acceptable quality, smallest size, fastest
"compile": Set toFalseto skip compiling this layer (optional, default:True)
Note
Best Practice: Use INT8 (A_BF16_W_INT8) for group layers to maintain quality during prefill, INT4 (A_BF16_W_INT4) for single-token layers for fast generation, and BF16 for vision encoders to preserve image understanding quality. For most models, this configuration provides the optimal balance between model accuracy, throughput, and memory usage.
Examples
Example 1: Compiling a Simple LLM
Compile a Llama model, downloaded from Hugging Face, with default settings:
sima-user@docker-image-id:/home/docker$ hf download meta-llama/Llama-3.2-3B --local-dir Llama-3.2-3B-Instruct
sima-user@docker-image-id:/home/docker$ llima-compile Llama-3.2-3B-Instruct -o Llama-3.2-3B-Instruct_out
This will:
Use default BF16 precision for all layers
Set context length to 1024 tokens
Output to
Llama-3.2-3B-Instructdirectory
Example 2: Compiling with Custom Context Length
sima-user@docker-image-id:/home/docker$ hf download meta-llama/Llama-3.2-3B --local-dir Llama-3.2-3B-Instruct
sima-user@docker-image-id:/home/docker$ llima-compile --max_num_tokens 4096 Llama-3.2-3B-Instruct -o Llama-3.2-3B-Instruct_out
This will:
Use default BF16 precision for all layers
Set context length to 4096 tokens
Output to
Llama-3.2-3B-Instructdirectory
Example 3: Compiling Gemma 3 VLM with Mixed Precision
For complex models like Gemma 3 VLM, you may need to specify different precisions for different layers (e.g., keeping the vision encoder in BF16).
Download the model:
sima-user@docker-image-id:/home/docker$ hf download simaai/gemma3-siglip448 --local-dir gemma-3-model
Create a configuration file (e.g.,
config.py):def get_layer_configuration(model_properties, layer): # Keep vision encoder in full precision if layer["part"] == "VISION": precision = "BF16" # Use INT8 for batch processing layers (better quality) elif layer["is_group"]: precision = "A_BF16_W_INT8" # Use INT4 for single-token layers (smaller size) else: precision = "A_BF16_W_INT4" return {"precision": precision}
Run the compiler:
sima-user@docker-image-id:/home/docker$ llima-compile -c config.py --max_num_tokens 2048 gemma-3-model -o gemma-3-model_out
Example 4: Advanced Configuration
Mixed precision with layer-specific control:
def get_layer_configuration(model_properties, layer):
# Skip compiling certain cache layers
if layer["part"] == "CACHE" and layer["index"] > 20:
return {"compile": False}
# Higher precision for early layers
if layer["index"] < 4:
return {"precision": "BF16"}
# Standard quantization for middle layers
return {"precision": "A_BF16_W_INT8"}