Introduction to LLiMa

Overview

The GenAI Model Compilation feature streamlines the process of compiling GenAI models from three input formats: HuggingFace safetensors, GGUF files, and pre-quantized compressed tensor models (e.g. GPTQ/AWQ models created with llm-compressor). For a wide set of different models like Llama, Gemma, Phi, Qwen, Mistral or LFM from Hugging Face, the SDK automatically generates all required binary/elf files along with the Python orchestration script, enabling direct execution on the Sima.ai Modalix platform.

SiMa has precompiled several popular models and published them on Hugging Face. LLiMa is not installed by default. To get started, create the LLiMa directory structure and install it globally on your Modalix device:

modalix:~$ cd /media/nvme && mkdir llima && cd llima
modalix:~$ sima-cli install -v 2.1.0 tools/llima -t full

This creates the required directories (including /media/nvme/llima/models for model storage) and makes the llima CLI available system-wide.

Model Manager

LLiMa includes a model manager accessible via the llima CLI. It lets you search, download, and run precompiled models directly from the command line. Models are stored under /media/nvme/llima/models by default; this path can be overridden by setting the LLIMA_MODELS_PATH environment variable.

Browse available models:

modalix:~$ llima search
modalix:~$ llima search qwen

Download a model by name (without the simaai/ org prefix):

modalix:~$ llima pull Qwen3-VL-8B-Instruct-a16w4

List and remove locally installed models:

modalix:~$ llima list
modalix:~$ llima rm Qwen3-VL-8B-Instruct-a16w4

Run a model directly in CLI or web mode:

modalix:~$ llima run Qwen3-VL-8B-Instruct-a16w4
modalix:~$ llima run Qwen3-VL-8B-Instruct-a16w4 --mode web

More details on the full set of llima run options can be found in the Runtime & Orchestration section.

GenAI Demo

For the full GenAI demo experience — including the web frontend and speech-to-text/text-to-speech support — use the run.sh script instead:

modalix:/media/nvme/llima$ ./run.sh

This prompts you to select a precompiled model and launches the complete demo application. More information can be found in the LLiMa demo application.

Supported Models

The following table shows the supported model architectures and their capabilities:

Model Architecture

Type

Supported Sizes

Llama 2

LLM

7b

Llama 3.1

LLM

8b

Llama 3.2

LLM

1b, 3b

Gemma 1

LLM

2b, 7b

Gemma 2

LLM

2b, 9b

Gemma 3

LLM

1b, 4b

Phi 3.5 mini

LLM

3.8b

Qwen 2.5

LLM

0.5b, 1.5b, 3b, 7b

Qwen 3

LLM

0.6b, 1.7b, 4b, 8b

Mistral 1

LLM

7b

LFM 2

LLM

350m, 1.2b, 2.6b

Llava 1.5

VLM

7b

PaliGemma

VLM

3b

Gemma 3

VLM

4b

Qwen 2.5 VL

VLM

3b, 7b

Qwen 3 VL

VLM

2b, 4b, 8b

LFM 2

VLM

450m, 1.6b, 3b

Limitations

Limitation Type

Description

Model Architecture

Only models based on the architectures listed above are supported.

Model Parameters

Only models with parameter count less than 10B are supported.

HF Models

Models must be downloaded from Hugging Face and contain: config.json, tokenizer.json, tokenizer_config.json, generation_config.json and weights in safetensors format

GGUF Models

GGUF format is supported for LLMs only. VLMs must be compiled from the Hugging Face safetensors format. Note that performance may decrease compared to HuggingFace safetensor compilation.

Compressed Tensor Models

Pre-quantized safetensor models (GPTQ/AWQ) created with llm-compressor are supported for LLMs only. The model must use symmetric quantization. These models can achieve better accuracy than standard INT4 quantization while maintaining high performance.

Gemma3 VLM

Supported with modified SigLip 448 vision encoder

LLAMA 3.2 Vision

Vision models are not supported