Runtime & Orchestrationο
This chapter details the structure of the LLiMa application on the Modalix device, how to run models using the llima CLI or the GenAI demo script, and how to utilize the OpenAI-compatible API endpoint.
LLiMa Structure on Modalixο
By default, the LLiMa application and its associated models are installed in the llima directory. The structure is organized to separate the compiled model artifacts from the demo running logic.
The following tree illustrates the directory layout and the purpose of key files:
/media/nvme/llima/
βββ models/ # Pre-compiled model artifacts
β βββ gemma3-siglip448-a16w4/ # Vision-Language Model (standard)
β βββ gte-small-local/ # Local embedding model for RAG (Retrieval Augmented Generation)
β βββ whisper-small-a16w8/ # Speech-to-Text model artifacts
βββ simaai-genai-demo/ # Main LLiMa Application Orchestration Directory
β βββ app.py # Main Python entry point for the frontend
β βββ milvus.db # Vector database storing embeddings for RAG
β βββ vectordb/ # Storage configuration for Vector DB
β βββ *.log # Application logs (app.log, server.log)
βββ run.sh # Shell script to launch the application with multiple modes
βββ install.sh # Setup script to install dependencies and environment
Running LLiMaο
There are two ways to run a model on the Modalix device:
llima run β lightweight inference via the
llimaCLI. Starts the model in CLI or web mode without the demo frontend or Piper TTS. Use this for direct API access or interactive terminal sessions.GenAI Demo (run.sh) β full demo stack. Wraps
llima runand additionally starts the web frontend application and Piper TTS. Use this for the complete demo experience.
llima runο
modalix:~$ llima run <model> [options]
Argument |
Description |
|---|---|
|
Model ID or path (e.g., |
|
Run mode: |
|
Path to the elf files for a Speech-to-Text model (optional). |
For all available options, run llima run -h.
Examples
Run a model in interactive CLI mode:
modalix:~$ llima run Qwen3-VL-4B-Instruct-a16w4
Run a model and whisper in web mode (exposes OpenAI-compatible API on port 9998):
modalix:~$ llima run Qwen3-VL-4B-Instruct-a16w4 --stt_model_path /media/nvme/llima/models/whisper-small-a16w8 --mode web
GenAI Demo (run.sh)ο
The run.sh script launches the full GenAI demo stack, including the web frontend and Piper TTS. It is located in the /media/nvme/llima/ directory and internally uses llima run.
modalix:/media/nvme/llima$ ./run.sh [options]
Argument |
Description |
|---|---|
|
Connects the application to an external RAG (Retrieval Augmented Generation) file processing server at the specified IP address. The script assumes the server is listening on port 7860. |
|
Starts the application server in HTTP-only mode, disabling Web UI specific features. |
|
Starts the OpenAI-compatible API endpoints without enabling the Web UI or TTS services. Ideal for headless integrations. |
|
Allows the user to provide a text file containing a custom system prompt to override the default behavior. |
|
Displays the help message and exits. |
Note
The run.sh script automatically scans the ../models/ directory for available model folders. If multiple valid models are detected, the script will prompt you to interactively select which model to load.
Examples
1. Default Mode (Secure Web App)
Running the script without arguments launches the full LLiMa experience. This includes the backend API, the web-based User Interface (accessible via HTTPS), and loads the Speech-to-Text (STT) and Text-to-Speech (TTS) models.
modalix:/media/nvme/llima$ ./run.sh
2. HTTP Only Mode (Insecure but Faster)
Use this flag to launch the full application (Endpoints + UI) over HTTP instead of HTTPS. This is often useful for local testing or if you are behind a proxy handling SSL termination.
modalix:/media/nvme/llima$ ./run.sh --httponly
3. Advanced Combinations
You can combine flags to tailor the runtime environment. For example, the following command launches the API endpoints over HTTP without loading the Web UI or TTS/STT services (lightweight inference server).
modalix:/media/nvme/llima$ ./run.sh --httponly --api-only
LoRA Switchingο
When a model has been compiled with LoRA support, adapters can be dynamically applied or removed at runtime without restarting the model. Switching a LoRA adapter clears the chat history.
CLI mode (llima run <model>)
Type the following commands directly into the interactive prompt:
>>> set lora <adapter_name>
>>> unset lora
Web mode (llima run <model> --mode web)
Use the following HTTP endpoints on port 9998:
Endpoint |
Description |
|---|---|
|
Activates a LoRA adapter. Request body: |
|
Deactivates the current LoRA adapter and reverts to the base model. No request body required. |
Example:
$ curl -X POST "http://<modalix-ip>:9998/set_lora" \
-H "Content-Type: application/json" \
-d '{"name": "my_adapter"}'
$ curl -X POST "http://<modalix-ip>:9998/unset_lora"
API Endpoint Referenceο
The LLiMa application exposes a RESTful API for chat generation, audio processing, and system control. The following sections detail the available endpoints and their required parameters.
Note
Port Usage:
Port 9998 (HTTP) β inference server endpoints: chat/image generation, speech-to-text (
/v1/audio/transcriptions), and LoRA switching. Available with bothllima run --mode weband the GenAI Demo (run.sh).Port 5000 (HTTPS) β GenAI Demo application layer: text-to-speech, voice selection, system control, and RAG endpoints. Only available when running the full demo via
run.sh.
Chat & Generation (Port 9998)
The application supports two endpoints for text generation, allowing integration with clients expecting either OpenAI or Ollama formats. Both endpoints trigger the inference backend and support streaming.
Endpoints (served on port 9998)
OpenAI Compatible:
POST /v1/chat/completionsOllama Compatible:
POST /v1/chatLoRA Activate:
POST /set_loraLoRA Deactivate:
POST /unset_lora
Parameters
Parameter |
Type |
Description |
|---|---|---|
|
Array |
Required. A list of message objects representing the conversation history. For Text-only:
Use a simple string content: For Vision (VLM): Use an array of content parts to send text and images together. Images must be base64-encoded data URIs. Example structure: {
"role": "user",
"content": [
{"type": "image", "image": "data:image/jpeg;base64,<base64_string>"},
{"type": "text", "text": "Describe this image"}
]
}
|
|
Boolean |
If |
|
Object |
(Ollama only) Optional configuration parameters passed to the backend. |
Note
Using Custom System Prompts
To define a custom system behavior (System Prompt), include a message with "role": "system" as the first element in your messages array. To ensure the model retains this persona across interactions, you must include this system message at the start of every request. On the first request it is cached, then the cached tokens are passed automatically without need for re-processing the system prompt tokens.
Example Structure:
# Initial Request
messages = [
{"role": "system", "content": "You are a helpful assistant that speaks in rhymes."},
{"role": "user", "content": "Hi there!"}
]
# Subsequent Inference (Append new user query after system prompt)
messages = [
{"role": "system", "content": "You are a helpful assistant that speaks in rhymes."},
{"role": "user", "content": "What is the weather?"}
]
β
Speech-to-Text (Port 9998)
POST /v1/audio/transcriptions
Transcribes an audio file into text (Speech-to-Text).
Parameter |
Type |
Description |
|---|---|---|
|
File |
Required. The audio file to transcribe (sent as |
|
String |
Language code of the audio content. Default is βenβ. |
β
Text-to-Speech & Voice (Port 5000 β GenAI Demo only)
POST /v1/audio/speech
Generates audio from input text (Text-to-Speech).
Parameter |
Type |
Description |
|---|---|---|
|
String |
Required. The text to generate audio for. |
|
String |
The ID of the specific voice model to use. |
|
String |
Language code (e.g., βenβ, βfrβ). Default is βenβ. |
|
String |
Audio format (e.g., βwavβ). Default is βwavβ. |
GET /voices
Retrieves a list of available TTS voices.
Parameter |
Type |
Description |
|---|---|---|
|
String |
Filter voices by language code (e.g., |
POST /voices/select
Switches the active voice model for the system.
Parameter |
Type |
Description |
|---|---|---|
|
String |
Required. The unique identifier of the voice to select (obtained from |
|
String |
The language context for the voice. Default is βenβ. |
β
System & RAG Operations (Port 5000 β GenAI Demo only)
POST /stop
Immediately halts any ongoing inference or processing tasks.
Parameters: None.
GET /raghealth
Checks the status of the Retrieval Augmented Generation (RAG) services.
Parameters: None.
Returns: JSON object with status of
rag_dbandrag_fpsservices.
POST /upload-to-rag
Uploads a document to the vector database for RAG context.
Parameter |
Type |
Description |
|---|---|---|
|
File |
Required. The document to upload. Supported formats: PDF, TXT, MD, XML. Sent as |
POST /import-rag-db
Replaces the existing vector database with a provided DB file.
Parameter |
Type |
Description |
|---|---|---|
|
File |
Required. The |
Endpoint Examplesο
This section provides practical examples of how to interact with the LLiMa Endpoint provided APIs using both command-line tools and Python. Examples will include the /v1/chat/completions and the /stop endpoints.
The following examples demonstrate how to send a request to the OpenAI-compatible /v1/chat/completions endpoint.
cURL (Terminal)
Use the -N flag to enable immediate streaming output. Replace <modalix-ip> with your deviceβs IP address.
1. Text Generation
$ curl -N -X POST "http://<modalix-ip>:9998/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
"messages": [
{ "role": "user", "content": "Why is the sky blue?" }
],
"stream": true
}'
2. Visual Language Model (Text + Image)
To send an image, encode it as a base64 data URI and pass it within the content array.
$ curl -N -X POST "http://<modalix-ip>:9998/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
"stream": true,
"messages": [
{
"role": "user",
"content": [
{
"type": "image",
"image": "data:image/jpeg;base64,<YOUR_BASE64_STRING_HERE>"
},
{
"type": "text",
"text": "Describe this image"
}
]
}
]
}'
Python
This Python script script handles both text-only and multimodal (Image + Text) requests.
#... INITIALIZATIONS AND IMPORTS
## Message fromat for image + text
messages = [{"role": "user", "content": USER_PROMPT}]
## Message fromat for image + text
messages = [{
"role": "user",
"content": [
{"type": "image", "image": f"data:image/jpeg;base64,{encoded_img}"},
{"type": "text", "text": USER_PROMPT}
]
}]
payload = {"messages": messages, "stream": True}
response = requests.post("http://<modalix-ip>:9998/v1/chat/completions", json=payload, stream=True, timeout=60)
## ... STREAMED RESPONSE HANDLING
If a generation is running too long or needs to be cancelled, use the /stop endpoint on the inference server.
cURL (Terminal)
$ curl -X POST http://<modalix-ip>:9998/stop
Python
import requests
MODALIX_IP = "192.168.1.20" # Replace with your device IP
URL = f"http://{MODALIX_IP}:9998/stop"
try:
response = requests.post(URL, timeout=10)
if response.status_code == 200:
print("Inference stopped successfully.")
else:
print(f"Failed to stop. Status: {response.status_code}")
except Exception as e:
print(f"Error: {e}")
Note
To stop the full application (including audio and RAG services), use the application-layer endpoint instead: POST https://<modalix-ip>:5000/stop