Introduction to LLiMa
Overview
The GenAI Model Compilation feature streamlines the process of compiling GenAI models from three input formats: HuggingFace safetensors, GGUF files, and pre-quantized compressed tensor models (e.g. GPTQ/AWQ models created with llm-compressor). For a wide set of different models like Llama, Gemma, Phi, Qwen, Mistral or LFM from Hugging Face, the SDK automatically generates all required binary/elf files along with the Python orchestration script, enabling direct execution on the Sima.ai Modalix platform.
SiMa has precompiled several popular models and published them on Hugging Face. LLiMa is not installed by default. To get started, create the LLiMa directory structure and install it globally on your Modalix device:
modalix:~$ cd /media/nvme && mkdir llima && cd llima
modalix:~$ sima-cli install -v 2.1.0 tools/llima -t full
This creates the required directories (including /media/nvme/llima/models for model storage) and makes the llima CLI available system-wide.
Model Manager
LLiMa includes a model manager accessible via the llima CLI. It lets you search, download, and run precompiled models directly from the command line. Models are stored under /media/nvme/llima/models by default; this path can be overridden by setting the LLIMA_MODELS_PATH environment variable.
Browse available models:
modalix:~$ llima search
modalix:~$ llima search qwen
Download a model by name (without the simaai/ org prefix):
modalix:~$ llima pull Qwen3-VL-8B-Instruct-a16w4
List and remove locally installed models:
modalix:~$ llima list
modalix:~$ llima rm Qwen3-VL-8B-Instruct-a16w4
Run a model directly in CLI or web mode:
modalix:~$ llima run Qwen3-VL-8B-Instruct-a16w4
modalix:~$ llima run Qwen3-VL-8B-Instruct-a16w4 --mode web
More details on the full set of llima run options can be found in the Runtime & Orchestration section.
GenAI Demo
For the full GenAI demo experience — including the web frontend and speech-to-text/text-to-speech support — use the run.sh script instead:
modalix:/media/nvme/llima$ ./run.sh
This prompts you to select a precompiled model and launches the complete demo application. More information can be found in the LLiMa demo application.
Supported Models
The following table shows the supported model architectures and their capabilities:
Model Architecture |
Type |
Supported Sizes |
|---|---|---|
LLM |
||
LLM |
||
LLM |
1b, 3b |
|
LLM |
2b, 7b |
|
LLM |
2b, 9b |
|
LLM |
||
LLM |
||
LLM |
||
LLM |
||
LLM |
||
LLM |
350m, 1.2b, 2.6b |
|
VLM |
||
VLM |
||
VLM |
||
VLM |
||
VLM |
||
VLM |
Limitations
Limitation Type |
Description |
|---|---|
Model Architecture |
Only models based on the architectures listed above are supported. |
Model Parameters |
Only models with parameter count less than 10B are supported. |
HF Models |
Models must be downloaded from Hugging Face and contain: |
GGUF Models |
GGUF format is supported for LLMs only. VLMs must be compiled from the Hugging Face safetensors format. Note that performance may decrease compared to HuggingFace safetensor compilation. |
Compressed Tensor Models |
Pre-quantized safetensor models (GPTQ/AWQ) created with llm-compressor are supported for LLMs only. The model must use symmetric quantization. These models can achieve better accuracy than standard INT4 quantization while maintaining high performance. |
Gemma3 VLM |
Supported with modified SigLip 448 vision encoder |
LLAMA 3.2 Vision |
Vision models are not supported |