Runtime & Orchestration

This chapter details the structure of the LLiMa application on the Modalix device, how to execute the application using the run script, and how to utilize it as an OpenAI-compatible endpoint.

LLiMa Structure on Modalix

By default, the LLiMa application and its associated models are installed in the llima directory. The structure is organized to separate the compiled model artifacts from the demo running logic.

The following tree illustrates the directory layout and the purpose of key files:

/media/nvme/llima/
├── gte-small-local/                   # Local embedding model for RAG (Retrieval Augmented Generation)
├── llama-3.2-3B-Instruct-a16w4-102/   # Pre-compiled LLM artifacts (weights and configuration) which can be installed directly from SiMa HF Repo
├── whisper-small-a16w8/               # Pre-compiled Speech-to-Text model artifacts
└── simaai-genai-demo/                 # Main LLiMa Application Orchestration Directory
    ├── app.py                         # Main Python entry point for the frontend
    ├── run.sh                         # Shell script to launch the application with multiple modes
    ├── install.sh                     # Setup script to install dependencies and environment performed with the LLiMa install
    ├── milvus.db                      # Vector database storing embeddings for RAG, precompiled for Modalix Brief Documentation for testing and demo
    ├── vectordb/                      # Storage configuration for Vector DB
    ├── *.log                          # Application (app.log, run.log) and server/console logs

Running LLiMa

The primary method for launching the application is the run.sh script located in the simaai-genai-demo directory. This script orchestrates the environment, handles model selection, and starts the necessary backend and frontend services.

Command Syntax

./run.sh [options]

Arguments

The script supports several flags to customize the deployment mode (e.g., headless API, UI-only) or integrate with external services like RAG servers.

Argument	Description
`--ragfps <IP>`	Connects the application to an external RAG (Retrieval Augmented Generation) file processing server at the specified IP address. The script assumes the server is listening on port 7860.
`--httponly`	Starts the application server in HTTP-only mode, disabling Web UI specific features.
`--backend-only`	Launches only the `sima_lmm` service (LLiMa inference engine) without the Web UI.
`--api-only`	Starts the OpenAI-compatible API endpoints without enabling the Web UI or TTS services. Ideal for headless integrations.
`--system-prompt-file <path>`	Allows the user to provide a text file containing a custom system prompt to override the default behavior.
`-cli`	Starts the backend in interactive CLI mode rather than as a background process.
`-h, --help`	Displays the help message and exits.

Note

The run.sh script automatically scans the parent directory (../) for available model folders. If multiple valid models are detected, the script will prompt you to interactively select which model to load.

Runtime Examples

The run.sh script is versatile and can be configured for different deployment needs, from a full secure web application to isolated component testing.

1. Default Mode (Secure Web App)

Running the script without arguments launches the full LLiMa experience. This includes the backend API, the web-based User Interface (accessible via HTTPS), and loads the Speech-to-Text (STT) and Text-to-Speech (TTS) models.

modalix:~/llima/simaai-genai-demo$ ./run.sh

2. HTTP Only Mode (Insecure but Faster)

Use this flag to launch the full application (Endpoints + UI) over HTTP instead of HTTPS. This is often useful for local testing or if you are behind a proxy handling SSL termination.

modalix:~/llima/simaai-genai-demo$ ./run.sh --httponly

3. CLI Mode

If you prefer working directly in the terminal without any web interface or background web server, use the CLI mode.

modalix:~/llima/simaai-genai-demo$ ./run.sh --cli

4. Backend Only (Inference Server)

This mode launches only the LLM inference backend server. This is ideal when you need the processing power of the LLiMa engine but intend to connect it to a custom frontend or external application.

modalix:~/llima/simaai-genai-demo$ ./run.sh --backend-only

5. Advanced Combinations

You can combine flags to tailor the runtime environment. For example, the following command launches the API endpoints over HTTP without loading the Web UI or TTS/STT services (lightweight inference server).

modalix:~/llima/simaai-genai-demo$ ./run.sh --httponly --api-only

API Endpoint Reference

The LLiMa application exposes a RESTful API for chat generation, audio processing, and system control. The following sections detail the available endpoints and their required parameters.

Chat & Generation

The application supports two endpoints for text generation, allowing integration with clients expecting either OpenAI or Ollama formats. Both endpoints trigger the inference backend and support streaming.

Endpoints

OpenAI Compatible: POST /v1/chat/completions
Ollama Compatible: POST /v1/chat

Parameters

Parameter

Type

Description

messages

Array

Required. A list of message objects representing the conversation history.

For Text-only: Use a simple string content: {"role": "user", "content": "Hello"}

For Vision (VLM): Use an array of content parts to send text and images together. Images must be base64-encoded data URIs.

Example structure:

{
  "role": "user",
  "content": [
    {"type": "image", "image": "data:image/jpeg;base64,<base64_string>"},
    {"type": "text", "text": "Describe this image"}
  ]
}

stream

Boolean

If true, response is streamed. Default is true for OpenAI and false for Ollama.

options

Object

(Ollama only) Optional configuration parameters passed to the backend.

Note

Using Custom System Prompts

To define a custom system behavior (System Prompt), include a message with "role": "system" as the first element in your messages array. To ensure the model retains this persona across interactions, you must include this system message at the start of every request.In the first time it is cached, then the cached tokens are passed automatically without need for re-processing the system prompt tokens.

Example Structure:

# Initial Request
messages = [
    {"role": "system", "content": "You are a helpful assistant that speaks in rhymes."},
    {"role": "user", "content": "Hi there!"}
]

# Subsequent Inference (Append new user query after system prompt)
messages = [
    {"role": "system", "content": "You are a helpful assistant that speaks in rhymes."},
    {"role": "user", "content": "What is the weather?"}
]

—

Audio & Voice

POST /v1/audio/speech

Generates audio from input text (Text-to-Speech).

Parameter	Type	Description
`input`	String	Required. The text to generate audio for.
`voice`	String	The ID of the specific voice model to use.
`language`	String	Language code (e.g., “en”, “fr”). Default is “en”.
`response_format`	String	Audio format (e.g., “wav”). Default is “wav”.

POST /v1/audio/transcriptions

Transcribes an audio file into text (Speech-to-Text).

Parameter	Type	Description
`file`	File	Required. The audio file to transcribe (sent as `multipart/form-data`).
`language`	String	Language code of the audio content. Default is “en”.

GET /voices

Retrieves a list of available TTS voices.

Parameter	Type	Description
`lang`	String	Filter voices by language code (e.g., `?lang=en`). Default is “en”.

POST /voices/select

Switches the active voice model for the system.

Parameter	Type	Description
`voiceId`	String	Required. The unique identifier of the voice to select (obtained from `/voices`).
`lang`	String	The language context for the voice. Default is “en”.

—

System & RAG Operations

POST /stop

Immediately halts any ongoing inference or processing tasks.

Parameters: None.

GET /raghealth

Checks the status of the Retrieval Augmented Generation (RAG) services.

Parameters: None.
Returns: JSON object with status of rag_db and rag_fps services.

POST /upload-to-rag

Uploads a document to the vector database for RAG context.

Parameter	Type	Description
`file`	File	Required. The document to upload. Supported formats: PDF, TXT, MD, XML. Sent as `multipart/form-data`.

POST /import-rag-db

Replaces the existing vector database with a provided DB file.

Parameter	Type	Description
`dbfile`	File	Required. The `.db` file to restore.

Endpoint Examples

This section provides practical examples of how to interact with the LLiMa Endpoint provided APIs using both command-line tools and Python. Examples will include the /v1/chat/completions and the /stop endpoints.

The following examples demonstrate how to send a request to the OpenAI-compatible /v1/chat/completions endpoint.

cURL (Terminal)

Use the -N flag to enable immediate streaming output and -k (insecure) if using the device’s self-signed certificate. Replace <modalix-ip> with your device’s IP address.

1. Text Generation

$ curl -N -k -X POST "https://<modalix-ip>:5000/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      { "role": "user", "content": "Why is the sky blue?" }
    ],
    "stream": true
  }'

2. Visual Language Model (Text + Image)

To send an image, encode it as a base64 data URI and pass it within the content array.

$ curl -N -k -X POST "https://<modalix-ip>:5000/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{
    "stream": true,
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "type": "image",
            "image": "data:image/jpeg;base64,<YOUR_BASE64_STRING_HERE>"
          },
          {
            "type": "text",
            "text": "Describe this image"
          }
        ]
      }
    ]
  }'

Python

This Python script script handles both text-only and multimodal (Image + Text) requests.

#... INITIALIZATIONS AND IMPORTS

## Message fromat for image + text
messages = [{"role": "user", "content": USER_PROMPT}]

## Message fromat for image + text
messages = [{
  "role": "user",
  "content": [
    {"type": "image", "image": f"data:image/jpeg;base64,{encoded_img}"},
    {"type": "text", "text": USER_PROMPT}
  ]
}]

payload = {"messages": messages, "stream": True}

response = requests.post("https://<modalix-ip>:5000/v1/chat/completions", json=payload, stream=True, verify=False, timeout=60)

## ... STREAMED RESPONSE HANDLING

Warning

Advanced: Direct Inference Port (9998)

For advanced debugging or raw performance monitoring, it is possible to bypass the application layer and hit the inference server directly on port 9998. This must be run entirely from MODALIX Board.

URL: http://localhost:9998
Payload: {"text": "Your prompt here"}

Python Example:

import requests

url = "http://localhost:9998"
payload = {"text": "Describe the future of AI"}

# This returns raw text output, often including internal tokens or TPS metrics.
# Use with care as it bypasses standard formatting and safety checks.
response = requests.post(url, json=payload, timeout=30)
print(response.text)

If a generation is running too long or needs to be cancelled immediately depending on the logic you implement, use the /stop endpoint.

cURL (Terminal)

$ curl -k -X POST https://<modalix-ip>:5000/stop

Python

import requests

MODALIX_IP = "192.168.1.20" # Replace with your device IP
URL = f"https://{MODALIX_IP}:5000/stop"

try:
    # Using verify=False to handle self-signed certs on the device
    response = requests.post(URL, verify=False, timeout=10)

    if response.status_code == 200:
        print("✅ Inference stopped successfully.")
    else:
        print(f"⚠️ Failed to stop. Status: {response.status_code}")

except Exception as e:
    print(f"❌ Error: {e}")

Warning

Advanced: Direct Inference Port (9998)

For advanced debugging or raw performance monitoring, it is possible to bypass the application layer and hit the inference server directly on port 9998 to stop the running inference. This must be run entirely from MODALIX Board.

URL: http://localhost:9998/stop

Python Example:

import requests

requests.post("http://localhost:9998/stop")