.. _MoLE - Modalix Language Model Evaluator: MoLE - Modalix Language Model Evaluator ======================================== Overview -------- MoLE (Modalix Language Model Evaluator) is a benchmarking tool for evaluating the accuracy and performance of LLMs running on the Modalix platform. It extends `EleutherAI's lm-evaluation-harness `_ and supports two backends: - **hf** — runs evaluation on the host using HuggingFace transformers (baseline reference) - **modalix** — runs evaluation on a Modalix board via the ``llima benchmark-server`` Installation ------------ MoLE requires LLiMa to be installed on the Modalix device first. See :ref:`Introduction to LLiMa` for installation instructions. Install MoLE on your Modalix device using the ``sima-cli``: .. code-block:: console host:~$ sima-cli install tools/mole This installs MoLE into a virtual environment at ``~/sima-mole-venv``. Usage ----- First, activate the MoLE virtual environment: .. code-block:: console host:~$ source ~/sima-mole-venv/bin/activate MoLE is then invoked via the ``llima-benchmark`` CLI with two subcommands. The ```` argument is always the HuggingFace model ID (e.g., ``meta-llama/Llama-3.2-3B-Instruct``). Accuracy Benchmarking ~~~~~~~~~~~~~~~~~~~~~ Evaluates model quality against standard tasks: .. code-block:: console (sima-mole-venv) host:~$ llima-benchmark accuracy -b modalix \ -t \ -o \ --max_num_tokens \ --board_ip \ --board_model .. list-table:: :widths: 30 70 :header-rows: 1 * - Argument - Description * - ``model_id`` - HuggingFace model ID (e.g., ``meta-llama/Llama-3.2-3B-Instruct``). * - ``-b`` - Backend to use: ``modalix`` (run on board) or ``hf`` (run on host as reference baseline). * - ``-t`` - **Required.** One or more evaluation tasks. Example tasks: ``hellaswag``, ``triviaqa``, ``piqa``, ``winogrande``, ``wikitext``. See the `task list `_ for all available tasks. * - ``-o`` - Output directory for benchmark results. * - ``--board_ip`` - IP address of the Modalix board. Required for ``-b modalix``. * - ``--board_model`` - Path to the compiled model directory on the Modalix device (e.g., ``/media/nvme/llima/models/Llama-3.2-3B-Instruct-a16w4``). Required for ``-b modalix``. * - ``--max_num_tokens`` - Maximum context length. Must be equal to or smaller than the value used during compilation. * - ``-n, --num_samples`` - Number of samples to evaluate. Runs the full task set if not specified. * - ``--board_ssh_user`` - SSH username for the Modalix board. Optional, default: ``sima``. # * - ``--board_ssh_pass`` - SSH password for the Modalix board. Optional. Set to enable non-interactive automated benchmarking. .. important:: Accuracy benchmarking with ``-b modalix`` requires the model to be compiled with the ``--return_logits`` flag (see :ref:`GenAI Model Compilation`). If the model was compiled without this flag, benchmarking will fail at runtime. To use the HuggingFace backend as a reference baseline: .. code-block:: console (sima-mole-venv) host:~$ llima-benchmark accuracy -b hf -t -o For all available options, run ``llima-benchmark accuracy -h``. Performance Benchmarking ~~~~~~~~~~~~~~~~~~~~~~~~ Measures Time To First Token (TTFT) and Tokens Per Second (TPS) on a Modalix board for different input lengths: .. code-block:: console (sima-mole-venv) host:~$ llima-benchmark perf \ -o \ --board_ip \ --board_model \ --max_num_tokens --max_new_tokens .. list-table:: :widths: 30 70 :header-rows: 1 * - Argument - Description * - ``model_id`` - HuggingFace model ID (e.g., ``meta-llama/Llama-3.2-3B-Instruct``). * - ``-o`` - Output directory for benchmark results. * - ``--board_ip`` - IP address of the Modalix board. * - ``--board_model`` - Path to the compiled model directory on the Modalix device (e.g., ``/media/nvme/llima/models/Llama-3.2-3B-Instruct-a16w4``). * - ``--max_num_tokens`` - Maximum context length. Must be equal to or smaller than the value used during compilation. * - ``--max_new_tokens`` - Maximum number of tokens to generate in the output. * - ``--board_ssh_user`` - SSH username for the Modalix board. Optional, default: ``sima``. * - ``--board_ssh_pass`` - SSH password for the Modalix board. Optional. Set to enable non-interactive automated benchmarking. For all available options, run ``llima-benchmark perf -h``.