.. _Runtime & Orchestration:

Runtime & Orchestration
=======================

This chapter details the structure of the LLiMa application on the Modalix device, how to execute the application using the run script, and how to utilize it as an OpenAI-compatible endpoint.

LLiMa Structure on Modalix
~~~~~~~~~~~~~~~~~~~~~~~~~~

By default, the LLiMa application and its associated models are installed in the ``llima`` directory. The structure is organized to separate the compiled model artifacts from the demo running logic.

The following tree illustrates the directory layout and the purpose of key files:

.. code-block:: text

   /media/nvme/llima/
   ├── gte-small-local/                   # Local embedding model for RAG (Retrieval Augmented Generation)
   ├── llama-3.2-3B-Instruct-a16w4-102/   # Pre-compiled LLM artifacts (weights and configuration) which can be installed directly from SiMa HF Repo 
   ├── whisper-small-a16w8/               # Pre-compiled Speech-to-Text model artifacts
   └── simaai-genai-demo/                 # Main LLiMa Application Orchestration Directory
       ├── app.py                         # Main Python entry point for the frontend
       ├── run.sh                         # Shell script to launch the application with multiple modes 
       ├── install.sh                     # Setup script to install dependencies and environment performed with the LLiMa install
       ├── milvus.db                      # Vector database storing embeddings for RAG, precompiled for Modalix Brief Documentation for testing and demo
       ├── vectordb/                      # Storage configuration for Vector DB
       ├── *.log                          # Application (app.log, run.log) and server/console logs

Running LLiMa
~~~~~~~~~~~~~

The primary method for launching the application is the ``run.sh`` script located in the ``simaai-genai-demo`` directory. This script orchestrates the environment, handles model selection, and starts the necessary backend and frontend services.

**Command Syntax**

.. code-block:: console

   ./run.sh [options]

**Arguments**

The script supports several flags to customize the deployment mode (e.g., headless API, UI-only) or integrate with external services like RAG servers.

.. list-table::
   :widths: 30 70
   :header-rows: 1

   * - Argument
     - Description
   * - ``--ragfps <IP>``
     - Connects the application to an external RAG (Retrieval Augmented Generation) file processing server at the specified IP address. The script assumes the server is listening on port 7860.
   * - ``--httponly``
     - Starts the application server in HTTP-only mode, disabling Web UI specific features.
   * - ``--backend-only``
     - Launches only the ``sima_lmm`` service (LLiMa inference engine) without the Web UI.
   * - ``--api-only``
     - Starts the OpenAI-compatible API endpoints without enabling the Web UI or TTS services. Ideal for headless integrations.
   * - ``--system-prompt-file <path>``
     - Allows the user to provide a text file containing a custom system prompt to override the default behavior.
   * - ``-cli``
     - Starts the backend in interactive CLI mode rather than as a background process.
   * - ``-h, --help``
     - Displays the help message and exits.

.. note::
   The ``run.sh`` script automatically scans the parent directory (``../``) for available model folders. If multiple valid models are detected, the script will prompt you to interactively select which model to load.


Runtime Examples
~~~~~~~~~~~~~~~~

The ``run.sh`` script is versatile and can be configured for different deployment needs, from a full secure web application to isolated component testing.

**1. Default Mode (Secure Web App)**

Running the script without arguments launches the full LLiMa experience. This includes the backend API, the web-based User Interface (accessible via HTTPS), and loads the Speech-to-Text (STT) and Text-to-Speech (TTS) models.

.. code-block:: console

   modalix:~/llima/simaai-genai-demo$ ./run.sh

**2. HTTP Only Mode (Insecure but Faster)**

Use this flag to launch the full application (Endpoints + UI) over HTTP instead of HTTPS. This is often useful for local testing or if you are behind a proxy handling SSL termination.

.. code-block:: console

   modalix:~/llima/simaai-genai-demo$ ./run.sh --httponly

**3. CLI Mode**

If you prefer working directly in the terminal without any web interface or background web server, use the CLI mode.

.. code-block:: console

   modalix:~/llima/simaai-genai-demo$ ./run.sh --cli

**4. Backend Only (Inference Server)**

This mode launches *only* the LLM inference backend server. This is ideal when you need the processing power of the LLiMa engine but intend to connect it to a custom frontend or external application.

.. code-block:: console

   modalix:~/llima/simaai-genai-demo$ ./run.sh --backend-only


**5. Advanced Combinations**

You can combine flags to tailor the runtime environment. For example, the following command launches the API endpoints over HTTP without loading the Web UI or TTS/STT services (lightweight inference server).

.. code-block:: console

   modalix:~/llima/simaai-genai-demo$ ./run.sh --httponly --api-only


API Endpoint Reference
~~~~~~~~~~~~~~~~~~~~~~~~~~

The LLiMa application exposes a RESTful API for chat generation, audio processing, and system control. The following sections detail the available endpoints and their required parameters.

**Chat & Generation**

The application supports two endpoints for text generation, allowing integration with clients expecting either OpenAI or Ollama formats. Both endpoints trigger the inference backend and support streaming.

**Endpoints**

* **OpenAI Compatible:** ``POST /v1/chat/completions``
* **Ollama Compatible:** ``POST /v1/chat``


**Parameters**

.. list-table::
   :widths: 20 20 60
   :header-rows: 1

   * - Parameter
     - Type
     - Description
   * - ``messages``
     - Array
     - **Required**. A list of message objects representing the conversation history.
       
       For **Text-only**:
       Use a simple string content: ``{"role": "user", "content": "Hello"}``
       
       For **Vision (VLM)**:
       Use an array of content parts to send text and images together. Images must be base64-encoded data URIs.
       
       Example structure:
       
       .. code-block:: json
       
          {
            "role": "user", 
            "content": [
              {"type": "image", "image": "data:image/jpeg;base64,<base64_string>"},
              {"type": "text", "text": "Describe this image"}
            ]
          }
   * - ``stream``
     - Boolean
     - If ``true``, response is streamed. Default is ``true`` for OpenAI and ``false`` for Ollama.
   * - ``options``
     - Object
     - *(Ollama only)* Optional configuration parameters passed to the backend.


.. note:: **Using Custom System Prompts**

   To define a custom system behavior (System Prompt), include a message with ``"role": "system"`` as the *first* element in your ``messages`` array. To ensure the model retains this persona across interactions, you must include this system message at the start of every request.In the first time it is cached, then the cached tokens are passed automatically without need for re-processing the system prompt tokens.

   **Example Structure:**

   .. code-block:: python

      # Initial Request
      messages = [
          {"role": "system", "content": "You are a helpful assistant that speaks in rhymes."}, 
          {"role": "user", "content": "Hi there!"}
      ]

      # Subsequent Inference (Append new user query after system prompt)
      messages = [
          {"role": "system", "content": "You are a helpful assistant that speaks in rhymes."}, 
          {"role": "user", "content": "What is the weather?"}
      ]
---

**Audio & Voice**

**POST** ``/v1/audio/speech``

Generates audio from input text (Text-to-Speech).

.. list-table::
   :widths: 20 20 60
   :header-rows: 1

   * - Parameter
     - Type
     - Description
   * - ``input``
     - String
     - **Required**. The text to generate audio for.
   * - ``voice``
     - String
     - The ID of the specific voice model to use.
   * - ``language``
     - String
     - Language code (e.g., "en", "fr"). Default is "en".
   * - ``response_format``
     - String
     - Audio format (e.g., "wav"). Default is "wav".

**POST** ``/v1/audio/transcriptions``

Transcribes an audio file into text (Speech-to-Text).

.. list-table::
   :widths: 20 20 60
   :header-rows: 1

   * - Parameter
     - Type
     - Description
   * - ``file``
     - File
     - **Required**. The audio file to transcribe (sent as ``multipart/form-data``).
   * - ``language``
     - String
     - Language code of the audio content. Default is "en".

**GET** ``/voices``

Retrieves a list of available TTS voices.

.. list-table::
   :widths: 20 20 60
   :header-rows: 1

   * - Parameter
     - Type
     - Description
   * - ``lang``
     - String
     - Filter voices by language code (e.g., ``?lang=en``). Default is "en".

**POST** ``/voices/select``

Switches the active voice model for the system.

.. list-table::
   :widths: 20 20 60
   :header-rows: 1

   * - Parameter
     - Type
     - Description
   * - ``voiceId``
     - String
     - **Required**. The unique identifier of the voice to select (obtained from ``/voices``).
   * - ``lang``
     - String
     - The language context for the voice. Default is "en".

---

**System & RAG Operations**

**POST** ``/stop``

Immediately halts any ongoing inference or processing tasks.

* **Parameters:** None.

**GET** ``/raghealth``

Checks the status of the Retrieval Augmented Generation (RAG) services.

* **Parameters:** None.
* **Returns:** JSON object with status of ``rag_db`` and ``rag_fps`` services.

**POST** ``/upload-to-rag``

Uploads a document to the vector database for RAG context.

.. list-table::
   :widths: 20 20 60
   :header-rows: 1

   * - Parameter
     - Type
     - Description
   * - ``file``
     - File
     - **Required**. The document to upload. Supported formats: PDF, TXT, MD, XML. Sent as ``multipart/form-data``.

**POST** ``/import-rag-db``

Replaces the existing vector database with a provided DB file.

.. list-table::
   :widths: 20 20 60
   :header-rows: 1

   * - Parameter
     - Type
     - Description
   * - ``dbfile``
     - File
     - **Required**. The ``.db`` file to restore.

Endpoint Examples
~~~~~~~~~~~~~~~~~

This section provides practical examples of how to interact with the LLiMa Endpoint provided APIs using both command-line tools and Python. Examples will include the ``/v1/chat/completions`` and the ``/stop`` endpoints.

.. tabs::

   .. tab:: Chat Completions

      The following examples demonstrate how to send a request to the OpenAI-compatible ``/v1/chat/completions`` endpoint.

      **cURL (Terminal)**

      Use the ``-N`` flag to enable immediate streaming output and ``-k`` (insecure) if using the device's self-signed certificate. Replace ``<modalix-ip>`` with your device's IP address.

      **1. Text Generation**

      .. code-block:: bash

         $ curl -N -k -X POST "https://<modalix-ip>:5000/v1/chat/completions" \
           -H "Content-Type: application/json" \
           -d '{
             "messages": [
               { "role": "user", "content": "Why is the sky blue?" }
             ],
             "stream": true
           }'

      **2. Visual Language Model (Text + Image)**

      To send an image, encode it as a base64 data URI and pass it within the ``content`` array.

      .. code-block:: bash

         $ curl -N -k -X POST "https://<modalix-ip>:5000/v1/chat/completions" \
           -H "Content-Type: application/json" \
           -d '{
             "stream": true,
             "messages": [
               {
                 "role": "user", 
                 "content": [
                   {
                     "type": "image", 
                     "image": "data:image/jpeg;base64,<YOUR_BASE64_STRING_HERE>"
                   },
                   {
                     "type": "text", 
                     "text": "Describe this image"
                   }
                 ]
               }
             ]
           }'

      **Python**

      This `Python script <https://docs.sima.ai/pkg_downloads/SDK2.0.0/samples/vlm-codegen/chat_completion_request.py>`_ script handles both text-only and multimodal (Image + Text) requests.

      .. code-block:: python

         #... INITIALIZATIONS AND IMPORTS
         
         ## Message fromat for image + text
         messages = [{"role": "user", "content": USER_PROMPT}]                        

         ## Message fromat for image + text
         messages = [{
           "role": "user",
           "content": [
             {"type": "image", "image": f"data:image/jpeg;base64,{encoded_img}"},
             {"type": "text", "text": USER_PROMPT}
           ]
         }]                                                                          

         payload = {"messages": messages, "stream": True}

         response = requests.post("https://<modalix-ip>:5000/v1/chat/completions", json=payload, stream=True, verify=False, timeout=60)

         ## ... STREAMED RESPONSE HANDLING
      
      .. warning:: **Advanced: Direct Inference Port (9998)**

         For advanced debugging or raw performance monitoring, it is possible to bypass the application layer and hit the inference server directly on port ``9998``. This must be run entirely from MODALIX Board.

         * **URL:** ``http://localhost:9998``
         * **Payload:** ``{"text": "Your prompt here"}``

         **Python Example:**

         .. code-block:: python

            import requests

            url = "http://localhost:9998"
            payload = {"text": "Describe the future of AI"}
            
            # This returns raw text output, often including internal tokens or TPS metrics.
            # Use with care as it bypasses standard formatting and safety checks.
            response = requests.post(url, json=payload, timeout=30)
            print(response.text)

   .. tab:: Stopping Inference

      If a generation is running too long or needs to be cancelled immediately depending on the logic you implement, use the ``/stop`` endpoint.

      **cURL (Terminal)**

      .. code-block:: bash

         $ curl -k -X POST https://<modalix-ip>:5000/stop

      **Python**

      .. code-block:: python

         import requests

         MODALIX_IP = "192.168.1.20" # Replace with your device IP
         URL = f"https://{MODALIX_IP}:5000/stop"

         try:
             # Using verify=False to handle self-signed certs on the device
             response = requests.post(URL, verify=False, timeout=10)
             
             if response.status_code == 200:
                 print("✅ Inference stopped successfully.")
             else:
                 print(f"⚠️ Failed to stop. Status: {response.status_code}")
                 
         except Exception as e:
             print(f"❌ Error: {e}")

      
      .. warning:: **Advanced: Direct Inference Port (9998)**

         For advanced debugging or raw performance monitoring, it is possible to bypass the application layer and hit the inference server directly on port ``9998`` to stop the running inference. This must be run entirely from MODALIX Board.

         * **URL:** ``http://localhost:9998/stop``

         **Python Example:**

         .. code-block:: python

            import requests

            requests.post("http://localhost:9998/stop")


.. .. warning:: **Advanced: Direct Inference Port (9998)**

..    For advanced debugging or raw performance monitoring, it is possible to bypass the application layer and hit the inference server directly on port ``9998``. This must be run entirely from MODALIX Board.

..    * **URL:** ``http://localhost:9998``
..    * **Payload:** ``{"text": "Your prompt here"}``

..    **Python Example:**

..    .. code-block:: python

..       import requests

..       url = "http://localhost:9998"
..       payload = {"text": "Describe the future of AI"}
      
..       # This returns raw text output, often including internal tokens or TPS metrics.
..       # Use with care as it bypasses standard formatting and safety checks.
..       response = requests.post(url, json=payload, timeout=30)
..       print(response.text)