.. _model_executor:

ModelExecutor
#############

``ModelExecutor`` is an on-device neural network inference API for SiMa.ai MLSoC. It abstracts
GStreamer pipeline setup, model loading, input preprocessing, and output retrieval behind a simple
C++ and Python interface.

.. contents:: On this page
   :local:
   :depth: 2

Overview
--------

Given a model packaged as a ``.tar.gz`` archive, ``ModelExecutor``:

1. Extracts and validates the archive
2. Reads pipeline, preprocessing, and quantization configuration
3. Constructs and launches a GStreamer pipeline (``appsrc → preprocessing → MLA → postprocessing → appsink``)
4. Handles frame injection, output retrieval, and memory management

Both **synchronous** (blocking) and **asynchronous** (callback-based) inference modes are supported,
along with an integrated profiling mode for KPI measurement.

Architecture
------------

.. code-block:: text

   ┌──────────────────────────────────────────────────────────────────┐
   │                        ModelExecutor                             │
   │                                                                  │
   │  init(model.tar.gz, options)                                     │
   │    ├─ Extract archive → /var/tmp/modelExecutor/models/           │
   │    ├─ Parse pipeline_sequence.json, preprocessing.json           │
   │    └─ Launch GStreamer pipeline:                                 │
   │         appsrc → [preproc] → MLA → [postproc] → appsink          │
   │                                                                  │
   │  runSynchronous(inputs) ──────────────────────► outputs          │
   │                                                                  │
   │  runAsynchronous(inputs, metadata, callback)                     │
   │    ├─ Pending queue (max 8)                                      │
   │    ├─ Buffer prep worker                                         │
   │    ├─ Async producer → appsrc                                    │
   │    ├─ Appsink consumer                                           │
   │    ├─ Copy workers (×4)                                          │
   │    └─ Callback worker ──────────────────────► callback(output)   │
   └──────────────────────────────────────────────────────────────────┘

Installation
------------

``ModelExecutor`` and its Python bindings come **preinstalled** on both eLxr and Yocto images.
No separate installation step is needed. Example applications are available on the device at
``/usr/local/simaai/examples/model_executor/``.

Threading Model
---------------

.. list-table::
   :widths: 40 60
   :header-rows: 1

   * - **Operation**
     - **Thread Safety**
   * - ``init()`` / ``stop()``
     - Must be called from the **same thread**
   * - ``runSynchronous()``
     - Caller must **serialize** calls (single-threaded)
   * - ``runAsynchronous()``
     - **Thread-safe** — concurrent calls from multiple threads allowed
   * - Callbacks
     - Always invoked on the **callback worker thread**; serialized one at a time

Asynchronous mode spawns 7 internal worker threads on first use: buffer-prep, async producer,
appsink consumer, 4 copy workers, and 1 callback worker. All threads are joined on ``stop()``.

Model Archive Format
--------------------

The ``.tar.gz`` archive must contain:

.. code-block:: text

   {project_name}/
   ├── etc/
   │   ├── pipeline_sequence.json   GStreamer plugin chain definition
   │   ├── preprocessing.json       Input size, format, and normalization metadata
   │   └── quantization.json        Quantization parameters
   └── lib/
       ├── *.elf                    Model binary (EV74 kernel)
       └── *.so                     Model shared library (A65 kernel)

Archives are extracted to ``/var/tmp/modelExecutor/models/`` on the first ``init()`` call.

Known Limitations
-----------------

- **EV74 kernel: maximum tensor dimension of 4096.** Use ``KernelType::A65`` for larger tensors.
- **Maximum of 15 pipeline segments per model.** ``init()`` will fail if exceeded.
- **No mixed-precision models.** All layers must use a single uniform precision (INT8, INT16, or BF16).
- **Python** ``runSynchronous()`` **returns only the first output tensor.** Use ``runAsynchronous()``
  or the C++ API to retrieve all outputs from multi-output models.
- **No re-initialization without** ``stop()``. Calling ``init()`` again without first calling
  ``stop()`` is not supported.