Runtime & Orchestrationο
This chapter details the structure of the LLiMa application on the Modalix device, how to execute the application using the run script, and how to utilize it as an OpenAI-compatible endpoint.
LLiMa Structure on Modalixο
By default, the LLiMa application and its associated models are installed in the llima directory. The structure is organized to separate the compiled model artifacts from the demo running logic.
The following tree illustrates the directory layout and the purpose of key files:
/media/nvme/llima/
βββ gte-small-local/ # Local embedding model for RAG (Retrieval Augmented Generation)
βββ llama-3.2-3B-Instruct-a16w4-102/ # Pre-compiled LLM artifacts (weights and configuration) which can be installed directly from SiMa HF Repo
βββ whisper-small-a16w8/ # Pre-compiled Speech-to-Text model artifacts
βββ simaai-genai-demo/ # Main LLiMa Application Orchestration Directory
βββ app.py # Main Python entry point for the frontend
βββ run.sh # Shell script to launch the application with multiple modes
βββ install.sh # Setup script to install dependencies and environment performed with the LLiMa install
βββ milvus.db # Vector database storing embeddings for RAG, precompiled for Modalix Brief Documentation for testing and demo
βββ vectordb/ # Storage configuration for Vector DB
βββ *.log # Application (app.log, run.log) and server/console logs
Running LLiMaο
The primary method for launching the application is the run.sh script located in the simaai-genai-demo directory. This script orchestrates the environment, handles model selection, and starts the necessary backend and frontend services.
Command Syntax
./run.sh [options]
Arguments
The script supports several flags to customize the deployment mode (e.g., headless API, UI-only) or integrate with external services like RAG servers.
Argument |
Description |
|---|---|
|
Connects the application to an external RAG (Retrieval Augmented Generation) file processing server at the specified IP address. The script assumes the server is listening on port 7860. |
|
Starts the application server in HTTP-only mode, disabling Web UI specific features. |
|
Launches only the |
|
Starts the OpenAI-compatible API endpoints without enabling the Web UI or TTS services. Ideal for headless integrations. |
|
Allows the user to provide a text file containing a custom system prompt to override the default behavior. |
|
Starts the backend in interactive CLI mode rather than as a background process. |
|
Displays the help message and exits. |
Note
The run.sh script automatically scans the parent directory (../) for available model folders. If multiple valid models are detected, the script will prompt you to interactively select which model to load.
Runtime Examplesο
The run.sh script is versatile and can be configured for different deployment needs, from a full secure web application to isolated component testing.
1. Default Mode (Secure Web App)
Running the script without arguments launches the full LLiMa experience. This includes the backend API, the web-based User Interface (accessible via HTTPS), and loads the Speech-to-Text (STT) and Text-to-Speech (TTS) models.
modalix:~/llima/simaai-genai-demo$ ./run.sh
2. HTTP Only Mode (Insecure but Faster)
Use this flag to launch the full application (Endpoints + UI) over HTTP instead of HTTPS. This is often useful for local testing or if you are behind a proxy handling SSL termination.
modalix:~/llima/simaai-genai-demo$ ./run.sh --httponly
3. CLI Mode
If you prefer working directly in the terminal without any web interface or background web server, use the CLI mode.
modalix:~/llima/simaai-genai-demo$ ./run.sh --cli
4. Backend Only (Inference Server)
This mode launches only the LLM inference backend server. This is ideal when you need the processing power of the LLiMa engine but intend to connect it to a custom frontend or external application.
modalix:~/llima/simaai-genai-demo$ ./run.sh --backend-only
5. Advanced Combinations
You can combine flags to tailor the runtime environment. For example, the following command launches the API endpoints over HTTP without loading the Web UI or TTS/STT services (lightweight inference server).
modalix:~/llima/simaai-genai-demo$ ./run.sh --httponly --api-only
API Endpoint Referenceο
The LLiMa application exposes a RESTful API for chat generation, audio processing, and system control. The following sections detail the available endpoints and their required parameters.
Chat & Generation
The application supports two endpoints for text generation, allowing integration with clients expecting either OpenAI or Ollama formats. Both endpoints trigger the inference backend and support streaming.
Endpoints
OpenAI Compatible:
POST /v1/chat/completionsOllama Compatible:
POST /v1/chat
Parameters
Parameter |
Type |
Description |
|---|---|---|
|
Array |
Required. A list of message objects representing the conversation history. For Text-only:
Use a simple string content: For Vision (VLM): Use an array of content parts to send text and images together. Images must be base64-encoded data URIs. Example structure: {
"role": "user",
"content": [
{"type": "image", "image": "data:image/jpeg;base64,<base64_string>"},
{"type": "text", "text": "Describe this image"}
]
}
|
|
Boolean |
If |
|
Object |
(Ollama only) Optional configuration parameters passed to the backend. |
Note
Using Custom System Prompts
To define a custom system behavior (System Prompt), include a message with "role": "system" as the first element in your messages array. To ensure the model retains this persona across interactions, you must include this system message at the start of every request.In the first time it is cached, then the cached tokens are passed automatically without need for re-processing the system prompt tokens.
Example Structure:
# Initial Request
messages = [
{"role": "system", "content": "You are a helpful assistant that speaks in rhymes."},
{"role": "user", "content": "Hi there!"}
]
# Subsequent Inference (Append new user query after system prompt)
messages = [
{"role": "system", "content": "You are a helpful assistant that speaks in rhymes."},
{"role": "user", "content": "What is the weather?"}
]
β
Audio & Voice
POST /v1/audio/speech
Generates audio from input text (Text-to-Speech).
Parameter |
Type |
Description |
|---|---|---|
|
String |
Required. The text to generate audio for. |
|
String |
The ID of the specific voice model to use. |
|
String |
Language code (e.g., βenβ, βfrβ). Default is βenβ. |
|
String |
Audio format (e.g., βwavβ). Default is βwavβ. |
POST /v1/audio/transcriptions
Transcribes an audio file into text (Speech-to-Text).
Parameter |
Type |
Description |
|---|---|---|
|
File |
Required. The audio file to transcribe (sent as |
|
String |
Language code of the audio content. Default is βenβ. |
GET /voices
Retrieves a list of available TTS voices.
Parameter |
Type |
Description |
|---|---|---|
|
String |
Filter voices by language code (e.g., |
POST /voices/select
Switches the active voice model for the system.
Parameter |
Type |
Description |
|---|---|---|
|
String |
Required. The unique identifier of the voice to select (obtained from |
|
String |
The language context for the voice. Default is βenβ. |
β
System & RAG Operations
POST /stop
Immediately halts any ongoing inference or processing tasks.
Parameters: None.
GET /raghealth
Checks the status of the Retrieval Augmented Generation (RAG) services.
Parameters: None.
Returns: JSON object with status of
rag_dbandrag_fpsservices.
POST /upload-to-rag
Uploads a document to the vector database for RAG context.
Parameter |
Type |
Description |
|---|---|---|
|
File |
Required. The document to upload. Supported formats: PDF, TXT, MD, XML. Sent as |
POST /import-rag-db
Replaces the existing vector database with a provided DB file.
Parameter |
Type |
Description |
|---|---|---|
|
File |
Required. The |
Endpoint Examplesο
This section provides practical examples of how to interact with the LLiMa Endpoint provided APIs using both command-line tools and Python. Examples will include the /v1/chat/completions and the /stop endpoints.
The following examples demonstrate how to send a request to the OpenAI-compatible /v1/chat/completions endpoint.
cURL (Terminal)
Use the -N flag to enable immediate streaming output and -k (insecure) if using the deviceβs self-signed certificate. Replace <modalix-ip> with your deviceβs IP address.
1. Text Generation
$ curl -N -k -X POST "https://<modalix-ip>:5000/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
"messages": [
{ "role": "user", "content": "Why is the sky blue?" }
],
"stream": true
}'
2. Visual Language Model (Text + Image)
To send an image, encode it as a base64 data URI and pass it within the content array.
$ curl -N -k -X POST "https://<modalix-ip>:5000/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
"stream": true,
"messages": [
{
"role": "user",
"content": [
{
"type": "image",
"image": "data:image/jpeg;base64,<YOUR_BASE64_STRING_HERE>"
},
{
"type": "text",
"text": "Describe this image"
}
]
}
]
}'
Python
This Python script script handles both text-only and multimodal (Image + Text) requests.
#... INITIALIZATIONS AND IMPORTS
## Message fromat for image + text
messages = [{"role": "user", "content": USER_PROMPT}]
## Message fromat for image + text
messages = [{
"role": "user",
"content": [
{"type": "image", "image": f"data:image/jpeg;base64,{encoded_img}"},
{"type": "text", "text": USER_PROMPT}
]
}]
payload = {"messages": messages, "stream": True}
response = requests.post("https://<modalix-ip>:5000/v1/chat/completions", json=payload, stream=True, verify=False, timeout=60)
## ... STREAMED RESPONSE HANDLING
Warning
Advanced: Direct Inference Port (9998)
For advanced debugging or raw performance monitoring, it is possible to bypass the application layer and hit the inference server directly on port 9998. This must be run entirely from MODALIX Board.
URL:
http://localhost:9998Payload:
{"text": "Your prompt here"}
Python Example:
import requests
url = "http://localhost:9998"
payload = {"text": "Describe the future of AI"}
# This returns raw text output, often including internal tokens or TPS metrics.
# Use with care as it bypasses standard formatting and safety checks.
response = requests.post(url, json=payload, timeout=30)
print(response.text)
If a generation is running too long or needs to be cancelled immediately depending on the logic you implement, use the /stop endpoint.
cURL (Terminal)
$ curl -k -X POST https://<modalix-ip>:5000/stop
Python
import requests
MODALIX_IP = "192.168.1.20" # Replace with your device IP
URL = f"https://{MODALIX_IP}:5000/stop"
try:
# Using verify=False to handle self-signed certs on the device
response = requests.post(URL, verify=False, timeout=10)
if response.status_code == 200:
print("β
Inference stopped successfully.")
else:
print(f"β οΈ Failed to stop. Status: {response.status_code}")
except Exception as e:
print(f"β Error: {e}")
Warning
Advanced: Direct Inference Port (9998)
For advanced debugging or raw performance monitoring, it is possible to bypass the application layer and hit the inference server directly on port 9998 to stop the running inference. This must be run entirely from MODALIX Board.
URL:
http://localhost:9998/stop
Python Example:
import requests
requests.post("http://localhost:9998/stop")