MoLE - Modalix Language Model Evaluator
Overview
MoLE (Modalix Language Model Evaluator) is a benchmarking tool for evaluating the accuracy and performance of LLMs running on the Modalix platform.
It extends EleutherAI’s lm-evaluation-harness and supports two backends:
hf — runs evaluation on the host using HuggingFace transformers (baseline reference)
modalix — runs evaluation on a Modalix board via the
llima benchmark-server
Installation
MoLE requires LLiMa to be installed on the Modalix device first. See Introduction to LLiMa for installation instructions.
Install MoLE on your Modalix device using the sima-cli:
host:~$ sima-cli install tools/mole
This installs MoLE into a virtual environment at ~/sima-mole-venv.
Usage
First, activate the MoLE virtual environment:
host:~$ source ~/sima-mole-venv/bin/activate
MoLE is then invoked via the llima-benchmark CLI with two subcommands. The <model_id> argument is always the HuggingFace model ID (e.g., meta-llama/Llama-3.2-3B-Instruct).
Accuracy Benchmarking
Evaluates model quality against standard tasks:
(sima-mole-venv) host:~$ llima-benchmark accuracy <model_id> -b modalix \
-t <task> \
-o <output_dir> \
--max_num_tokens <max_num_tokens> \
--board_ip <board_ip> \
--board_model <model_path_on_board>
Argument |
Description |
|---|---|
|
HuggingFace model ID (e.g., |
|
Backend to use: |
|
Required. One or more evaluation tasks. Example tasks: |
|
Output directory for benchmark results. |
|
IP address of the Modalix board. Required for |
|
Path to the compiled model directory on the Modalix device (e.g., |
|
Maximum context length. Must be equal to or smaller than the value used during compilation. |
|
Number of samples to evaluate. Runs the full task set if not specified. |
|
SSH username for the Modalix board. Optional, default: |
|
SSH password for the Modalix board. Optional. Set to enable non-interactive automated benchmarking. |
Important
Accuracy benchmarking with -b modalix requires the model to be compiled with the --return_logits flag (see GenAI Model Compilation). If the model was compiled without this flag, benchmarking will fail at runtime.
To use the HuggingFace backend as a reference baseline:
(sima-mole-venv) host:~$ llima-benchmark accuracy <model_id> -b hf -t <task> -o <output_dir>
For all available options, run llima-benchmark accuracy -h.
Performance Benchmarking
Measures Time To First Token (TTFT) and Tokens Per Second (TPS) on a Modalix board for different input lengths:
(sima-mole-venv) host:~$ llima-benchmark perf <model_id> \
-o <output_dir> \
--board_ip <board_ip> \
--board_model <model_path_on_board> \
--max_num_tokens <max_num_tokens> --max_new_tokens <max_new_tokens>
Argument |
Description |
|---|---|
|
HuggingFace model ID (e.g., |
|
Output directory for benchmark results. |
|
IP address of the Modalix board. |
|
Path to the compiled model directory on the Modalix device (e.g., |
|
Maximum context length. Must be equal to or smaller than the value used during compilation. |
|
Maximum number of tokens to generate in the output. |
|
SSH username for the Modalix board. Optional, default: |
|
SSH password for the Modalix board. Optional. Set to enable non-interactive automated benchmarking. |
For all available options, run llima-benchmark perf -h.