.. _Introduction to LLiMa: Introduction to LLiMa ===================== Overview -------- The GenAI Model Compilation feature streamlines the process of compiling GenAI models from three input formats: HuggingFace safetensors, GGUF files, and pre-quantized compressed tensor models (e.g. GPTQ/AWQ models created with llm-compressor). For a wide set of different models like ``Llama``, ``Gemma``, ``Phi``, ``Qwen``, ``Mistral`` or ``LFM`` from Hugging Face, the SDK automatically generates all required binary/elf files along with the Python orchestration script, enabling direct execution on the Sima.ai Modalix platform. SiMa has precompiled several popular models and published them on `Hugging Face `_. LLiMa is not installed by default. To get started, create the LLiMa directory structure and install it globally on your Modalix device: .. code-block:: console modalix:~$ cd /media/nvme && mkdir llima && cd llima modalix:~$ sima-cli install -v 2.1.0 tools/llima -t full This creates the required directories (including ``/media/nvme/llima/models`` for model storage) and makes the ``llima`` CLI available system-wide. Model Manager ~~~~~~~~~~~~~ LLiMa includes a model manager accessible via the ``llima`` CLI. It lets you search, download, and run precompiled models directly from the command line. Models are stored under ``/media/nvme/llima/models`` by default; this path can be overridden by setting the ``LLIMA_MODELS_PATH`` environment variable. Browse available models: .. code-block:: console modalix:~$ llima search modalix:~$ llima search qwen Download a model by name (without the ``simaai/`` org prefix): .. code-block:: console modalix:~$ llima pull Qwen3-VL-8B-Instruct-a16w4 List and remove locally installed models: .. code-block:: console modalix:~$ llima list modalix:~$ llima rm Qwen3-VL-8B-Instruct-a16w4 Run a model directly in CLI or web mode: .. code-block:: console modalix:~$ llima run Qwen3-VL-8B-Instruct-a16w4 modalix:~$ llima run Qwen3-VL-8B-Instruct-a16w4 --mode web More details on the full set of ``llima run`` options can be found in the :ref:`Runtime & Orchestration` section. GenAI Demo ~~~~~~~~~~ For the full GenAI demo experience — including the web frontend and speech-to-text/text-to-speech support — use the ``run.sh`` script instead: .. code-block:: console modalix:/media/nvme/llima$ ./run.sh This prompts you to select a precompiled model and launches the complete demo application. More information can be found in the `LLiMa demo application <../overview/hello_sima/run_demos.html#llm-demo>`_. Supported Models ---------------- The following table shows the supported model architectures and their capabilities: .. list-table:: :widths: 30 15 55 :header-rows: 1 * - Model Architecture - Type - Supported Sizes * - `Llama 2 `_ - LLM - `7b `_ * - `Llama 3.1 `_ - LLM - `8b `_ * - `Llama 3.2 `_ - LLM - 1b, `3b `_ * - `Gemma 1 `_ - LLM - 2b, 7b * - `Gemma 2 `_ - LLM - 2b, 9b * - `Gemma 3 `_ - LLM - `1b `_, `4b `_ * - `Phi 3.5 mini `_ - LLM - `3.8b `_ * - `Qwen 2.5 `_ - LLM - `0.5b `_, `1.5b `_, 3b, `7b `_ * - `Qwen 3 `_ - LLM - `0.6b `_, `1.7b `_, `4b `_, `8b `_ * - `Mistral 1 `_ - LLM - `7b `_ * - `LFM 2 `_ - LLM - 350m, 1.2b, 2.6b * - `Llava 1.5 `_ - VLM - `7b `_ * - `PaliGemma `_ - VLM - `3b `_ * - `Gemma 3 `_ - VLM - `4b `_ * - `Qwen 2.5 VL `_ - VLM - `3b `_, `7b `_ * - `Qwen 3 VL `_ - VLM - 2b, `4b `_, `8b `_ * - `LFM 2 `_ - VLM - `450m `_, `1.6b `_, `3b `_ Limitations ----------- .. list-table:: :widths: 30 70 :header-rows: 1 :class: wrapped-table * - Limitation Type - Description * - Model Architecture - Only models based on the architectures listed above are supported. * - Model Parameters - Only models with parameter count less than 10B are supported. * - HF Models - Models must be downloaded from Hugging Face and contain: ``config.json``, ``tokenizer.json``, ``tokenizer_config.json``, ``generation_config.json`` and weights in safetensors format * - GGUF Models - GGUF format is supported for LLMs only. VLMs must be compiled from the Hugging Face safetensors format. Note that performance may decrease compared to HuggingFace safetensor compilation. * - Compressed Tensor Models - Pre-quantized safetensor models (GPTQ/AWQ) created with llm-compressor are supported for LLMs only. The model must use symmetric quantization. These models can achieve better accuracy than standard INT4 quantization while maintaining high performance. * - Gemma3 VLM - Supported with modified SigLip 448 vision encoder * - LLAMA 3.2 Vision - Vision models are not supported